Hugging Face Launches mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages AI NEWS

Home
AInews
Hugging Face Launches mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages

Hugging Face Launches mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages

2025-09-30

Hugging Face has introduced mmBERT, a groundbreaking multilingual encoder trained on over 3 trillion tokens across 1833 languages. Built on the ModernBERT architecture, it surpasses XLM-R—long the standard in multilingual understanding tasks—by a significant margin.

mmBERT employs a progressive training strategy instead of training all languages at once. It begins with 60 high-resource languages, expands to 110, and ultimately covers the full set of 1833 languages. The model reduces its masking rate from 30% to 5% and adjusts the sampling distribution to better represent low-resource languages.

This “progressive language addition” approach proves crucial for broad coverage without overfitting. For example, languages like Faroese and Tigrinya—introduced only during the final 100 billion token decay phase—still demonstrate notable performance gains thanks to this strategy.

Members of the AI community expressed interest in this balanced approach. AI practitioner Yasir Altaf posed the following questions:

How do you ensure that low-resource languages aren’t overshadowed during the 1833-language phase? Is there a minimum signal threshold for each language? Additionally, how can we be certain the model won’t be dominated by the top 50 languages, even if technically trained on 1833?

Tom Aarsen, a Hugging Face engineer and maintainer of Sentence Transformers, responded:

This is verified by evaluating some low-resource languages introduced only in the final 100 billion token phase, such as Tigrinya and Faroese. When added at this stage, significant improvements were observed.

mmBERT builds on the ModernBERT architecture, inheriting its fast and memory-efficient backbone, enhanced with Flash Attention 2 and padding-free sequence processing for up to 8192 tokens of context.

Despite having only 110 million non-embedding parameters, the base model competes effectively with larger multilingual models. A lighter 140 million parameter variant is also available for less demanding applications.

Featuring a 22-layer encoder, mmBERT supports sequences up to 8192 tokens long. The base model includes 110 million non-embedding parameters (307 million total), with a smaller 140 million parameter variant for improved efficiency.

One standout feature is model merging. Instead of relying on a single trained model, the team combined three variants—English-centric, 110-language, and full multilingual—using TIES merging to preserve performance across domains.

In evaluations, mmBERT consistently outperforms earlier multilingual encoders. On GLUE benchmarks, it matches English-only baselines despite having less than a quarter of its training data in English. On XTREME, it shows clear improvements in cross-lingual tasks such as XNLI and TyDiQA while maintaining strong performance in structured prediction. In retrieval tasks, mmBERT sets a new record on the MTEBv2 multilingual benchmark, even matching monolingual English models on the English track.

mmBERT demonstrates that scaling multilingual encoders doesn’t have to come at the cost of efficiency. By balancing broad coverage with targeted enhancements, it sets a new standard for multilingual models across retrieval, classification, and cross-lingual tasks.

3D Look AI

AI body scanner for accurate body measurements

VulnZap

AI code vulnerability scanner

The Furnisher

AI room design tool for quick makeovers

Dexter

AI agent for comprehensive financial research

Harness AI

AI-powered DevOps automation for faster code delivery

Tad AI

AI music generator for custom royalty-free tracks

HiPeople

AI platform for efficient and unbiased hiring

RECENT AI TOOLS

Doctronic

3D Look AI

VulnZap

The Furnisher

Dexter

RECENT AI NEWS

OpenAI Releases GPT-5.2 with Cutting-Edge Mathematical Capabilities

Disney Partners with OpenAI to Allow Sora to Generate AI Videos Featuring Its Characters

Runway Launches Its First World Model and Adds Native Audio to Its Latest Video Model

Google Launches “Disco”: A Gemini-Powered Tool That Turns Browser Tabs into Web Apps

Google AI Try-On: Snap a Selfie to Try Clothes

1X Reaches Agreement to Bring “Home” Humanoid Robots into Factories and Warehouses

Google Adds New Features to Boost Website Visibility in AI Search

Google Launches Sub-$5 AI Plus Plan in India to Compete with ChatGPT Go

RECENT AI TOOLS