Meta Launches AutoPatchBench to Evaluate LLM Agents in Security Patching

2025-05-08

AutoPatchBench serves as a standardized benchmark designed to assist researchers and developers in evaluating and comparing the effectiveness of LLM-based agents in automatically fixing security vulnerabilities found in C/C++ native code.

It includes a series of tests aimed at assessing the ability of LLMs to autonomously generate secure patches for vulnerabilities identified through fuzz testing.

This benchmark seeks to foster a comprehensive understanding of the capabilities and limitations of various AI-driven approaches to fixing vulnerabilities uncovered by fuzz testing. By offering a consistent set of evaluation criteria, AutoPatchBench promotes transparency and reproducibility in research.

In contrast to general benchmarks like SWE-Bench and SWE-Bench Verified, which are used to evaluate software engineering agents, AutoPatchBench focuses on the unique challenges posed by vulnerabilities discovered through fuzzing techniques, often involving significant security concerns.

AutoPatchBench is based on a subset of ARVO, a dataset comprising over 5,000 real-world C/C++ vulnerabilities identified by Google's OSS-Fuzz across more than 250 projects. Each vulnerability in ARVO comes with a triggering input and a developer-written patch.

We have curated 136 samples for AutoPatchBench that meet the necessary conditions for patch generation and validation. From this refined set, we created AutoPatchBench-Lite, a smaller subset of 113 samples, to provide a focused benchmark for AI-driven patch generation tools. These subsets preserve the diversity and complexity of real-world vulnerabilities, encompassing 11 different types of crash scenarios, thus laying a solid foundation for advancing AI-powered security solutions.

Fuzz testing is a technique that uncovers security vulnerabilities and flaws by reaching edge cases that human testers would rarely encounter. As noted by the creators of OpenSSF’s Fuzz Introspector, fuzz testing is a promising approach, but its challenge lies in crafting effective fuzzers that achieve good coverage.

Moreover, once a crash is identified through fuzz testing, resolving it is not straightforward. It requires a thorough analysis of the crash stack trace to pinpoint the root cause, followed by code patching and validation of the fix's effectiveness. This is where AI systems can play a crucial role, as demonstrated by Google in its technical report on AI-driven patching and the recent GITS-Eval benchmark.

A critical aspect of patch validation is ensuring that the patched program retains its intended behavior, which goes beyond simply checking if the program builds and does not crash when fed with the original triggering input. To address this, AutoPatchBench employs a specific technique to assess whether the generated patches produce the same program state as the original after the patched function returns.

In addition to AutoPatchBench, which contains the full set of 136 ARVO samples, Meta has also released AutoPatchBench-Lite, a smaller subset of 113 samples where the root causes of crashes are confined to a single function. This makes it more suitable for early-stage development tools or those focusing on simpler crash scenarios.

AutoPatchBench is part of CyberSecEval 4, an extensive benchmark suite for evaluating the vulnerability defense capabilities of LLMs. Meta has open-sourced its reference implementation for the community to leverage or build upon in open-source projects using fuzz testing to enhance patch models.