Anthropic has launched Claude Opus 4 and Sonnet 4, the newest additions to their Claude series of large language models (LLMs). Both models feature enhanced capabilities in extended reasoning, tool utilization, and memory retention. Claude 4 Opus outperforms other LLMs in coding benchmark tests.
The announcement was made during Anthropic's "Programming with Claude" event. The Claude 4 models are described as "hybrid" systems, capable of both quick responses and extended reasoning. In extended reasoning mode, these models can use tools like web searches, execute multiple tools simultaneously, and leverage local files for memory storage. Claude Opus 4 achieves a score of 72.5% on SWE-bench and 43.2% on the Terminal-bench coding benchmark, surpassing all other coding models. Additionally, Anthropic has made Claude Code fully available — a coding agent integrated with beta extensions for JetBrains and VS Code. According to Anthropic,
These models represent a significant step toward virtual collaborators — maintaining full context, focusing on longer projects, and driving transformative impact. They have undergone extensive testing and evaluation to minimize risks and enhance safety, including the implementation of higher AI safety standards such as ASL-3. We look forward to seeing what you create.
Claude 4 introduces several improvements over its predecessors. Anthropic states that Claude 4 reduces the likelihood of taking "shortcuts" to complete tasks by 65%. It also demonstrates "significantly superior memory capabilities compared to all previous models" through local file data storage. In reasoning mode, chain-of-thought outputs are summarized approximately "5% of the time" to reduce the space required for display.
Image Credit: Anthropic's Claude 4 Announcement
In a Hacker News discussion, users debated whether the new model's enhancements justify the full version increment. One user commented:
I’m a developer, and I’ve been experimenting with AI for writing application code for two years. This is the first time I’ve been able to write application code without significant manual intervention at every step. It’s not perfect, and I wouldn’t trust it completely without human review, but I did manage to build a production-ready iOS/Android/web app capable of accepting payments in under 24 hours, with almost no manual intervention beyond telling it what I wanted to do next.
Open-source developer Simon Willison live-blogged the release event and provided an in-depth analysis of the Claude 4 system card, which documents various scenarios and results from Anthropic's safety testing.
Anthropic’s system cards are always worth checking out, and this one for Opus 4 and Sonnet 4 contains some particularly fascinating insights. It’s also 120 pages long — nearly three times the length of the Claude 3.7 Sonnet system card! If you’re into hard sci-fi… this document will definitely satisfy your curiosity.
Anthropic's testing reveals that its models may take "extreme actions" in certain scenarios, though these instances are "rare and difficult to trigger, they occur more frequently than in earlier models." As part of its Responsible Scaling Policy (RSP), Anthropic has activated its AI Safety Level 3 (ASL-3) deployment and security standards with the release of Claude 4. This includes enhanced internal security measures aimed at preventing the theft of model weights.