OpenAI has officially launched o3-pro, the latest and most advanced model in its o series product line. Earlier versions of this model family have consistently excelled in standard AI benchmark tests, particularly in mathematics, programming, and scientific tasks, with o3-pro further enhancing these strengths.
According to OpenAI's release notes for o3-pro: "Like o1-pro, o3-pro is a version of our most intelligent model, o3, designed for deeper reasoning and to deliver the most reliable responses. Since the release of o1-pro, users have increasingly favored this model for tasks in mathematics, science, and programming—o3-pro continues to excel in these areas in academic evaluations."
The o3-pro model is currently available to professional and team users through ChatGPT and its API, with plans to roll out to educational and enterprise accounts next week following a schedule similar to previous models.
Evaluation Comparisons
Prior to releasing benchmark data, OpenAI allowed human testers to try o3-pro and compare it with o3. Most testers preferred o3-pro in key areas, including:
- All queries (64%)
- Scientific analysis (64.9%)
- Personal writing (66.7%)
- Computer programming (62.7%)
- Data analysis (64.3%)
Pass@1 Accuracy and Efficiency Benchmarks
Commonly used to measure the efficiency of modern AI models, the Pass@1 benchmark emphasizes a model's ability to generate accurate responses on the first attempt. Unsurprisingly, o3-pro outperformed both o3 and o1-pro across various benchmarks.
Competition Math (AIME 2024) | PhD-Level Science (GPQA Diamond) | Competition Programming (Codeforces) | |
---|---|---|---|
o3-pro | 93% | 84% | 2748 |
o3 | 90% | 81% | 2517 |
o1-pro | 86% | 79% | 1707 |
4/4 Reliability Benchmark
OpenAI's team conducted a 4/4 reliability benchmark test on their AI models. In these assessments, an AI model must provide correct responses across four attempts to succeed. Any failed attempt results in automatic failure for the 4/4 reliability benchmark.
Competition Math (AIME 2024) | PhD-Level Science (GPQA Diamond) | Competition Programming (Codeforces) | |
---|---|---|---|
o3-pro | 90% | 76% | 2301 |
o3 | 80% | 67% | 2011 |
o1-pro | 80% | 74% | 1423 |
Limitations of o3-pro
Some limitations of o3-pro to consider include:
- At the time of writing, o3-pro’s temporary chat feature is disabled as the OpenAI team addresses technical issues.
- o3-pro does not support image generation. Users requiring this functionality are advised to use GPT-4o, OpenAI o3, or OpenAI o4-mini.
- o3-pro does not support OpenAI’s Canvas interface, and it remains unclear if support will be added in the future.
Weighing the Pros and Cons of o3-pro
Although OpenAI acknowledges that o3-pro can run slower than o1-pro in some scenarios, this is a result of the additional features