Headquartered in China, artificial intelligence lab DeepSeek has unveiled a new open-weight model named DeepSeekMath-V2.
According to the AI lab, the model demonstrates exceptional theorem-proving capabilities in mathematics and achieved a gold-level performance at the 2025 International Mathematical Olympiad (IMO).
It successfully solved five out of the six problems presented at IMO 2025.
“Imagine having free access to the brain of one of the world’s top mathematicians,” said Clément Delangue, co-founder and CEO of Hugging Face, in a post on X.
“As far as I know, no chatbot or API currently offers access to an IMO 2025 gold-medal-winning model,” he added.
In July of this year, advanced versions of Google DeepMind’s Gemini model and OpenAI’s experimental reasoning model also earned gold-level scores at IMO 2025. Like DeepSeek’s new model, both solved five of the six competition problems—marking them as the first AI systems to achieve such high scores.
The IMO is widely regarded as the most challenging high school mathematics competition globally. Among the 630 participants in the 2025 edition, only 72 earned gold medals.
Beyond its IMO 2025 success, DeepSeekMath-V2 also delivered top-tier results in China’s most rigorous national contest—the Chinese Mathematical Olympiad (CMO)—and nearly aced the undergraduate-level Putnam Competition.
“In Putnam 2024, one of the most prestigious undergraduate math contests, our model fully solved 11 of the 12 problems, with minor errors in the remaining one, achieving a score of 118/120—surpassing the highest human score of 90,” DeepSeek stated.
DeepSeek noted that while recent AI models have excelled on mathematical benchmarks like AIME and HMMT, they often lack sound, step-by-step reasoning.
“Many mathematical tasks, such as theorem proving, require rigorous deductive reasoning rather than just numerical answers—making final-answer-based rewards ineffective.”
To address this, DeepSeek emphasized the need for models capable of evaluating and refining their own reasoning. The team believes “self-verification is especially critical for scaling test-time computation, particularly when tackling unsolved problems with no known solutions.”
Test-time computation refers to allocating significant computational resources during inference—not training—to allow models more time to explore multiple solution paths and refine their outputs.
DeepSeek’s approach involves training a dedicated verifier that evaluates proof quality rather than correctness of the final answer. This verifier then guides a separate proof-generation model, which receives rewards only when it corrects its own mistakes—not when it conceals them.
As explained in their paper, they “use the verifier as a reward model to train the proof generator, incentivizing it to identify and resolve as many of its own proof issues as possible before finalizing a solution.”
To prevent the system from overfitting to its own checker, DeepSeek continuously increases the difficulty of the verification process.
This is achieved by scaling computational resources and automatically labeling increasingly complex proofs, ensuring the verifier and generator evolve in tandem.
In their words, this strategy enables them “to scale verification computation, automatically label newly emerging hard-to-verify proofs, and generate training data to further enhance the verifier.”
The model weights are available for download on Hugging Face. “This is democratization of AI and knowledge—in the literal sense,” Delangue remarked.
DeepSeek rose to prominence after releasing a low-cost, open-source model capable of rivaling U.S.-based AI systems. The launch of its DeepSeek-R1 reasoning model sparked debate over whether open models could erode the commercial edge of closed systems, briefly unsettling investor confidence in AI giants like NVIDIA.