Claude Sonnet 4.5 Top SWE-Bench Validation, Extended Coding Focus Beyond 30 Hours AI NEWS

Home
AInews
Claude Sonnet 4.5 Top SWE-Bench Validation, Extended Coding Focus Beyond 30 Hours

Claude Sonnet 4.5 Top SWE-Bench Validation, Extended Coding Focus Beyond 30 Hours

2025-10-12

Anthropic has launched Claude Sonnet 4.5, its most advanced coding model to date, offering major enhancements in agent tasks, long-duration performance, and computer interaction capabilities. According to the company, improved training and safety protocols have significantly refined the model’s behavior, reducing tendencies toward flattery, deception, power-seeking, and delusional reasoning. The model is now available via the Claude API, as well as desktop and mobile apps, at the same price as its predecessor.

Claude Sonnet 4.5 builds upon Anthropic's strategy of incremental model improvements while maintaining consistency and safety. The model demonstrates the ability to sustain complex multi-step reasoning and code execution tasks for over 30 hours. On the SWE-bench validation benchmark, which measures an AI model’s ability to solve real-world software issues, Sonnet 4.5 scored 77.2%, up from 72.7% for Sonnet 4, marking a notable leap in autonomous coding capability. On the OSWorld benchmark, which evaluates real-world computer usage skills, Sonnet 4.5 achieved 61.4%, a significant increase from 42.2% four months prior.

Source: Anthropic Claude Sonnet 4.5

Anthropic describes Sonnet 4.5 as its “most consistent frontier model,” highlighting the balance between enhanced capabilities and rigorous safety measures. Under ASL-3, the company has improved automated classifiers to detect and block potentially harmful instructions, including those related to chemical, biological, radiological, or nuclear (CBRN) risks. According to Anthropic, false positive rates have dropped tenfold compared to when Claude Opus 4 was launched in May 2025.

To evaluate Claude Sonnet 4.5’s behavior in autonomous, tool-supported scenarios, Anthropic conducted a series of agent safety tests covering malicious code generation and adversarial prompt injection attacks. Out of 150 malicious coding requests prohibited under Anthropic’s usage policies, Sonnet 4.5 failed only two, reflecting enhanced safety training. The model achieved a safety score of 98.7%, compared to 89.3% for Claude Sonnet 4, demonstrating significantly stronger refusal behavior and resistance to malicious agent use.

Anthropic encourages all users to upgrade to Claude Sonnet 4.5, positioning it as a “drop-in replacement” that delivers superior performance without added cost.

Early adopters report measurable gains in coding workflows:

Scott Wu, co-founder and CEO of Cognition, noted, “For Devin, Claude Sonnet 4.5 improved planning performance by 18% and end-to-end evaluation scores by 12%—the largest improvement we’ve seen since the release of Claude Sonnet 3.6. It excels at testing its own code, allowing Devin to run longer, tackle more difficult tasks, and deliver production-ready code.”

Michele Catasta, president of Replit, shared, “Claude Sonnet 4.5’s editing capabilities are outstanding. We reduced error rates from 9% with Sonnet 4 to 0% on our internal code editing benchmark. Higher tool success rates and lower costs represent a major leap forward for agent-based coding. Claude Sonnet 4.5 strikes the perfect balance between creativity and control.”

Independent open-source developer Simon Wilson wrote on his blog, “My initial impression is that it feels more suited to coding than GPT-5-Codex. It has been my go-to coding model since its release a few weeks ago.”

Anthropic’s push for safer, more autonomous coding models aligns with similar advancements in the AI ecosystem. OpenAI recently released GPT-5-Codex, a version of GPT-5 optimized for complex software engineering tasks such as large-scale code refactoring and extended code review workflows.

3D Look AI

AI body scanner for accurate body measurements

VulnZap

AI code vulnerability scanner

The Furnisher

AI room design tool for quick makeovers

Dexter

AI agent for comprehensive financial research

Harness AI

AI-powered DevOps automation for faster code delivery

Tad AI

AI music generator for custom royalty-free tracks

HiPeople

AI platform for efficient and unbiased hiring

RECENT AI TOOLS

Doctronic

3D Look AI

VulnZap

The Furnisher

Dexter

RECENT AI NEWS

OpenAI Releases GPT-5.2 with Cutting-Edge Mathematical Capabilities

Disney Partners with OpenAI to Allow Sora to Generate AI Videos Featuring Its Characters

Runway Launches Its First World Model and Adds Native Audio to Its Latest Video Model

Google Launches “Disco”: A Gemini-Powered Tool That Turns Browser Tabs into Web Apps

Google AI Try-On: Snap a Selfie to Try Clothes

1X Reaches Agreement to Bring “Home” Humanoid Robots into Factories and Warehouses

Google Adds New Features to Boost Website Visibility in AI Search

Google Launches Sub-$5 AI Plus Plan in India to Compete with ChatGPT Go

RECENT AI TOOLS