A newly developed physics benchmark called "CritPt" has put leading AI models to the test during early-stage doctoral research scenarios. The findings reveal that even top-tier systems like Gemini 3 Pro and GPT-5 fall far short of functioning as autonomous scientific researchers.
Over 50 physicists from more than 30 institutions collaborated to create the CritPt benchmark, aiming to assess whether AI can genuinely assist in pushing the frontiers of modern physics. Rather than testing rote textbook knowledge, the benchmark challenges models with original, unpublished research problems akin to those a capable graduate student might tackle when launching an independent project.
Initial results set a sobering baseline. According to an independent evaluation by Artificial Analysis, Google’s “Gemini 3 Pro Preview” achieved only a 9.1% accuracy rate—using 10% fewer tokens than OpenAI’s “GPT-5.1 (High),” which placed second with just 4.9%. Even at the top of the leaderboard, AI systems failed the vast majority of tasks.
Doctoral-level reasoning remains a formidable hurdle
CritPt comprises 71 full research challenges spanning 11 physics domains—including quantum physics, astrophysics, high-energy physics, and biophysics. To prevent guessing or retrieval-based answers, all problems are based on previously undisclosed material. Each challenge is further broken down into 190 smaller “checkpoints” to measure partial progress.
These results offer a reality check: current large language models lack the rigor, creativity, and precision needed to independently solve open-ended physics problems. However, models do show measurable improvements on simpler, well-defined subtasks—suggesting a more realistic near-term role as specialized research assistants rather than autonomous scientists.
The team also introduced a stricter metric called the “consistent solution rate,” which requires a model to produce the correct answer in at least four out of five attempts. Under this criterion, performance collapsed across the board, revealing that even when models occasionally succeed, their reasoning remains highly fragile.
This inconsistency poses serious challenges for real-world research workflows. Models frequently generate responses that appear convincing but contain subtle, hard-to-detect errors—potentially misleading researchers and necessitating time-consuming expert verification.
Researchers argue that, for the foreseeable future, a more practical goal is not replacing human experts with “AI scientists,” but deploying AI as a “research assistant” to automate specific steps in scientific workflows. This aligns with current industry roadmaps: OpenAI plans to launch a research internship system in September 2026 and deliver a fully autonomous researcher by March 2028. The company claims GPT-5 is already saving researchers significant time.