Gemini 3 Pro and GPT-5 Still Fall Short on Complex Physics Tasks Designed for Real Scientific Research AI NEWS

Home
AInews
Gemini 3 Pro and GPT-5 Still Fall Short on Complex Physics Tasks Designed for Real Scientific Research

Gemini 3 Pro and GPT-5 Still Fall Short on Complex Physics Tasks Designed for Real Scientific Research

2025-11-24

A newly developed physics benchmark called "CritPt" has put leading AI models to the test during early-stage doctoral research scenarios. The findings reveal that even top-tier systems like Gemini 3 Pro and GPT-5 fall far short of functioning as autonomous scientific researchers.

Over 50 physicists from more than 30 institutions collaborated to create the CritPt benchmark, aiming to assess whether AI can genuinely assist in pushing the frontiers of modern physics. Rather than testing rote textbook knowledge, the benchmark challenges models with original, unpublished research problems akin to those a capable graduate student might tackle when launching an independent project.

Initial results set a sobering baseline. According to an independent evaluation by Artificial Analysis, Google’s “Gemini 3 Pro Preview” achieved only a 9.1% accuracy rate—using 10% fewer tokens than OpenAI’s “GPT-5.1 (High),” which placed second with just 4.9%. Even at the top of the leaderboard, AI systems failed the vast majority of tasks.

Doctoral-level reasoning remains a formidable hurdle

CritPt comprises 71 full research challenges spanning 11 physics domains—including quantum physics, astrophysics, high-energy physics, and biophysics. To prevent guessing or retrieval-based answers, all problems are based on previously undisclosed material. Each challenge is further broken down into 190 smaller “checkpoints” to measure partial progress.

These results offer a reality check: current large language models lack the rigor, creativity, and precision needed to independently solve open-ended physics problems. However, models do show measurable improvements on simpler, well-defined subtasks—suggesting a more realistic near-term role as specialized research assistants rather than autonomous scientists.

The team also introduced a stricter metric called the “consistent solution rate,” which requires a model to produce the correct answer in at least four out of five attempts. Under this criterion, performance collapsed across the board, revealing that even when models occasionally succeed, their reasoning remains highly fragile.

This inconsistency poses serious challenges for real-world research workflows. Models frequently generate responses that appear convincing but contain subtle, hard-to-detect errors—potentially misleading researchers and necessitating time-consuming expert verification.

Researchers argue that, for the foreseeable future, a more practical goal is not replacing human experts with “AI scientists,” but deploying AI as a “research assistant” to automate specific steps in scientific workflows. This aligns with current industry roadmaps: OpenAI plans to launch a research internship system in September 2026 and deliver a fully autonomous researcher by March 2028. The company claims GPT-5 is already saving researchers significant time.

Standard AI

Understand how customers shop with AI video analysis

Fiber AI

AI contact data search and verification tool

Google Antigravity

AI coding platform for agentic development

Scribble Vet

AI veterinary scribe for efficient clinical notes

Bender AI

Information retrieval error handling tool

Riskified

AI fraud detection tool for ecommerce merchants

Bet Ideas

AI predictions and tips for sports betting

RECENT AI TOOLS

11X

Standard AI

Fiber AI

Google Antigravity

Scribble Vet

RECENT AI NEWS

ChatGPT's Voice Mode Is No Longer a Separate Interface

Nvidia Stock Drops as Meta May Buy AI Chips from Google

Speechify Adds Voice Input and AI Assistant to Chrome Extension

Microsoft's AI chatbot Copilot to leave WhatsApp on January 15

Warner Music Reaches Agreement with AI Music Startup Suno to Resolve Lawsuit

OpenAI and Perplexity Launch AI Shopping Assistants

Google and Accel Join Forces to Explore New AI Breakthroughs in India

Alibaba.com Launches AI Mode to Power Agent-Based E-commerce

RECENT AI TOOLS