Models
IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources
IHUBERT is a newly released monolingual Persian pretrained language model based on the RoBERTa-base architecture with 125 million parameters, trained on a 45 GB subset of the Sepahr-Danesh collection, amounting to approximately 7-8 billion tokens. It incorporates a multi-stage preprocessing pipeline for semantic deduplication and distribution balancing, and it achieves state-of-the-art results on several Persian NLU benchmarks, particularly in extractive question answering. This model enhances the capabilities for practitioners working with Persian language processing by providing a high-quality, semantically curated resource that addresses previous limitations in the availability of training data and evaluation diversity.
persianlanguage modelspretraining