The recently released o3 model achieved a score of 75.7% on the ARC-AGI benchmark, and under high computational configurations, it even reached 87.5%. This significant achievement has garnered considerable attention in the artificial intelligence research community. Although the progress is remarkable, it does not mean that the path to artificial general intelligence (AGI) has been unlocked.
The ARC-AGI benchmark is designed based on an abstract reasoning corpus to evaluate the adaptability and fluid intelligence of AI systems in handling new tasks. The benchmark consists of a series of visual puzzles that require understanding basic concepts such as objects, boundaries, and spatial relationships. Humans can easily solve these problems with a few examples, while existing AI systems face challenges. The ARC is designed to prevent "gaming" through large training datasets, ensuring the effectiveness of the evaluation.
ARC-AGI includes a public training set and an evaluation set, each with 400 simple examples and more challenging puzzles, respectively, to test the generalization capabilities of AI systems. Additionally, there are private and semi-private test sets, each containing 100 puzzles, used to evaluate candidate systems without revealing the data and to limit the use of computational resources, preventing brute-force solutions.
Previously, the o1-preview and o1 versions scored 32% on ARC-AGI, while researcher Jeremy Berman achieved 53% using a hybrid strategy, which was the best record before o3. François Chollet believes that o3's performance demonstrates unprecedented task adaptability, marking a qualitative change compared to previous LLMs.
It is worth noting that increasing computational resources did not significantly improve the performance of previous models. From GPT-3 to GPT-4o, it took four years to increase the score from 0% to 5%. Specific details about the o3 architecture are limited, but it is clear that o3 is not several orders of magnitude larger than its predecessors.
The success of o3 relies on program synthesis, which involves constructing small programs to solve specific problems and combining these programs to tackle more complex issues. Traditional language models, although rich in internal programs, lack the composability needed to solve puzzles beyond their training scope.
There is disagreement in the scientific community regarding how o3 works. Chollet speculates that o3 uses chain-of-thought reasoning, search mechanisms, and reward models to evaluate and optimize solutions. However, others argue that o3 may simply be the result of extended reinforcement learning.
Chollet emphasizes that ARC-AGI is not a definitive measure of whether AGI has been achieved. o3 still fails at some simple tasks, highlighting fundamental differences from human intelligence. Additionally, o3 requires external validators for the reasoning process and depends on human-annotated reasoning chains during training.
Some scientists question the validity of OpenAI's reported results, pointing out that the model was fine-tuned on the ARC training set. To truly test the abstract and reasoning abilities of these models, it is suggested to observe whether the system can adapt to variations of specific tasks or apply the same concepts to different domains.
Chollet and his team are developing new benchmark tests aimed at challenging o3, potentially reducing its score to below 30% even with high computational budgets, while humans can solve 95% of the puzzles without training. When creating tasks that are easy for ordinary humans but difficult for AI becomes impossible, it may be the time when AGI arrives.