Hugging Face has introduced mmBERT, a groundbreaking multilingual encoder trained on over 3 trillion tokens across 1833 languages. Built on the ModernBERT architecture, it surpasses XLM-R—long the standard in multilingual understanding tasks—by a significant margin.
mmBERT employs a progressive training strategy instead of training all languages at once. It begins with 60 high-resource languages, expands to 110, and ultimately covers the full set of 1833 languages. The model reduces its masking rate from 30% to 5% and adjusts the sampling distribution to better represent low-resource languages.
This “progressive language addition” approach proves crucial for broad coverage without overfitting. For example, languages like Faroese and Tigrinya—introduced only during the final 100 billion token decay phase—still demonstrate notable performance gains thanks to this strategy.
Members of the AI community expressed interest in this balanced approach. AI practitioner Yasir Altaf posed the following questions:
How do you ensure that low-resource languages aren’t overshadowed during the 1833-language phase? Is there a minimum signal threshold for each language? Additionally, how can we be certain the model won’t be dominated by the top 50 languages, even if technically trained on 1833?
Tom Aarsen, a Hugging Face engineer and maintainer of Sentence Transformers, responded:
This is verified by evaluating some low-resource languages introduced only in the final 100 billion token phase, such as Tigrinya and Faroese. When added at this stage, significant improvements were observed.
mmBERT builds on the ModernBERT architecture, inheriting its fast and memory-efficient backbone, enhanced with Flash Attention 2 and padding-free sequence processing for up to 8192 tokens of context.
Despite having only 110 million non-embedding parameters, the base model competes effectively with larger multilingual models. A lighter 140 million parameter variant is also available for less demanding applications.
Featuring a 22-layer encoder, mmBERT supports sequences up to 8192 tokens long. The base model includes 110 million non-embedding parameters (307 million total), with a smaller 140 million parameter variant for improved efficiency.
One standout feature is model merging. Instead of relying on a single trained model, the team combined three variants—English-centric, 110-language, and full multilingual—using TIES merging to preserve performance across domains.
In evaluations, mmBERT consistently outperforms earlier multilingual encoders. On GLUE benchmarks, it matches English-only baselines despite having less than a quarter of its training data in English. On XTREME, it shows clear improvements in cross-lingual tasks such as XNLI and TyDiQA while maintaining strong performance in structured prediction. In retrieval tasks, mmBERT sets a new record on the MTEBv2 multilingual benchmark, even matching monolingual English models on the English track.
mmBERT demonstrates that scaling multilingual encoders doesn’t have to come at the cost of efficiency. By balancing broad coverage with targeted enhancements, it sets a new standard for multilingual models across retrieval, classification, and cross-lingual tasks.