DeepMind, the AI research lab under Google, has announced the development of an innovative AI system capable of addressing problems with "machine-scorable" solutions.
In experimental trials, this system—named AlphaEvolve—has been shown to assist in optimizing some of Google’s infrastructure used for training its AI models. DeepMind is currently developing a user interface to facilitate interaction with AlphaEvolve and plans to initiate an early access program for select academics before rolling it out more broadly.
A common challenge with most AI models is their tendency to hallucinate. Due to their probabilistic nature, they sometimes confidently generate false or fabricated content. Notably, newer AI models like OpenAI's o3 exhibit more hallucinations than their predecessors, underscoring the complexity of this issue.
AlphaEvolve introduces a clever mechanism to reduce hallucinations: an automated evaluation system. This system generates potential answers, critiques them, and compiles a pool of possible solutions, automatically assessing and scoring each answer based on accuracy.
AlphaEvolve is not the first system to employ such an approach. Researchers, including a team from DeepMind a few years ago, have applied similar techniques across various mathematical domains. However, DeepMind asserts that AlphaEvolve leverages cutting-edge models, particularly the Gemini model, making it more capable than earlier iterations.
To use AlphaEvolve, users must provide the system with a problem prompt, optionally supplemented by descriptions, equations, code snippets, and relevant literature. Additionally, they need to supply an automatic evaluation mechanism in the form of a formula to assess the generated solutions.
Because AlphaEvolve can only address problems it can self-evaluate, its scope is limited to certain types of questions, especially those within computer science and systems optimization. A significant limitation is that the system expresses solutions algorithmically, making it unsuitable for non-numerical problems.
To benchmark AlphaEvolve, DeepMind tested the system on a set of approximately 50 math problems spanning branches like geometry and combinatorics. According to DeepMind, AlphaEvolve successfully "rediscovered" the best-known solutions in 75% of cases and found improved solutions in 20% of instances.
DeepMind also evaluated AlphaEvolve’s performance on practical challenges, such as improving efficiency at Google data centers and accelerating model training runs. The lab reports that AlphaEvolve devised an algorithm that consistently reclaimed an average of 0.7% of Google’s global computing resources. Additionally, the system proposed an optimization strategy that reduced the total time required to train Google’s Gemini model by 1%.
It’s important to note that AlphaEvolve did not achieve any groundbreaking discoveries. In one experiment, the system identified a method to improve the design of Google’s TPU AI accelerator chips—a solution that had already been flagged by other tools.
Nevertheless, like many AI labs, DeepMind argues that AlphaEvolve offers significant utility: it can save time while allowing experts to focus on higher-priority tasks.