OpenAI has unveiled a new benchmark designed to evaluate its AI models' performance relative to human professionals across multiple industries and job functions. Released on Thursday, the benchmark, named GDPval, represents OpenAI's initial effort to understand how its systems perform on economically valuable tasks compared to humans. It also aligns with the company's founding mission of developing artificial general intelligence (AGI).
According to OpenAI, early research findings indicate that both the GPT-5 model and Anthropic's Claude Opus 4.1 "are approaching the quality of work produced by industry experts."
However, this does not mean that AI will immediately replace human labor. While some CEOs predict AI could replace human jobs within a few years, OpenAI acknowledges that GDPval currently covers only a narrow range of tasks humans perform in real-world jobs. Nevertheless, it is among the latest tools the company uses to track AI progress toward this milestone.
GDPval focuses on nine industries that significantly contribute to the U.S. GDP, including healthcare, finance, manufacturing, and government sectors. The benchmark evaluates AI models across 44 professions within these industries, ranging from software engineers to nurses and journalists.
In the first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by human experts and choose the better one. For example, one prompt asked investment bankers to create a competitive landscape report for the last-mile delivery industry and compare it with an AI-generated version. OpenAI then calculated the AI models' "win rate" across all 44 professions relative to human-generated reports.
For GPT-5-high, an enhanced version of GPT-5 with increased computational power, the model was rated as superior or comparable to industry experts 40.6% of the time.
OpenAI also tested Anthropic's Claude Opus 4.1, which was rated as superior or comparable to human experts in 49% of the tasks. OpenAI noted that Claude's high score may be due to its tendency to produce visually appealing outputs, not just functional ones.
It's important to note that most professional jobs involve much more than submitting reports to a manager—something GDPval-v0 does not test beyond report generation. OpenAI acknowledges this limitation and plans to develop more robust tests in the future that cover additional industries and interactive workflows.
Nonetheless, the company believes the progress measured by GDPval is worth monitoring.
"As models continue to improve in certain areas," said Chatterji, "professionals in these roles can now leverage AI capabilities to gradually offload parts of their work and focus on potentially higher-value tasks."
Tejal Patwardhan, OpenAI’s head of evaluation, expressed encouragement at the pace of GDPval's progress. The company's GPT-4o model scored only 13.7% (wins and ties against humans) around 15 months ago. Now, GPT-5's score is nearly three times higher, and Patwardhan expects this upward trend to continue.
The tech industry already employs a wide range of benchmarks to measure AI progress and determine whether a model is state-of-the-art. Among the most popular are AIME 2025 (a competitive math problem test) and GPQA Diamond (a PhD-level science question test). However, several AI models have approached saturation on these benchmarks, prompting many AI researchers to call for better tests that assess AI's proficiency in real-world tasks.
Benchmarks like GDPval may grow increasingly relevant in this conversation as OpenAI demonstrates the value of its AI models across diverse industries. However, OpenAI may need more comprehensive versions of the test to definitively claim that its models outperform humans.