Anthropic Launches Claude 4, Claimed as the Best Global Programming Model with Stable Performance for Seven Hours

2025-05-23

Brief Overview

  • After a prolonged delay, Claude 4 has finally been released, surpassing GPT-4.1 and Gemini 2.5 Pro in the SWE-bench coding benchmark.
  • The new model can autonomously code for up to seven hours and handle context windows of nearly one million tokens.
  • Anthropic has priced Claude Opus 4 at $75 per million output tokens—25 times more expensive than open-source alternatives like DeepSeek R1.

Anthropic unveiled its much-anticipated Claude 4 AI model family on Thursday, after months of delays. The San Francisco-based company, a major player in the fiercely competitive AI industry with a valuation exceeding $61 billion, claims that its new models achieve top-tier benchmarks in coding performance and autonomous task execution.

The newly launched models replace the two most powerful models in the Claude lineup: Opus, an advanced model excelling in understanding complex tasks, and Sonnet, a mid-sized model suited for everyday tasks. The smallest and most efficient model, Claude Haiku, remains unchanged at version 3.5.

Claude Opus 4 achieved a score of 72.5% on SWE-bench Verified, significantly outperforming competitors in coding benchmarks. OpenAI's GPT-4.1 scored just 54.6%, while Google's Gemini 2.5 Pro reached 63.2%. In reasoning tasks, Opus 4 scored 74.9% on GPQA Diamond (essentially a common sense benchmark), compared to GPT-4.1's 66.3%.

The model also outperformed rivals in other benchmarks measuring agent tasks, mathematics, and multilingual query capabilities.

Anthropic tailored Opus 4 with developers in mind, emphasizing sustained autonomous work sessions.

Rakuten’s AI team reported that the model independently coded for nearly seven hours in a complex open-source project. Yusuke Kaji, General Manager, described it as "a giant leap in AI capability, leaving the team in awe," according to a statement shared by Anthropic with Decrypt. This endurance far exceeds the typical task duration limits of previous AI models.

The two Claude 4 models operate as hybrid systems, offering either instant responses or delayed thought modes for complex reasoning—a concept close to what OpenAI plans to achieve with GPT-5m when merging the “o” and “GPT” families into a single model.

Opus 4 supports up to 128,000 output tokens for extended analysis and integrates tool usage during the thinking phase, allowing it to pause reasoning to search the web or access databases before resuming. These models can handle a full context window of nearly one million tokens.

Anthropic prices Claude Opus 4 at $15 per million input tokens and $75 per million output tokens. Claude Sonnet 4 costs $3 per million input tokens and $15 per million output tokens. Companies benefit from up to 90% cost savings via prompt caching and a 50% reduction through batching, although base rates remain higher than some competitors.

However, this still represents a significant price level compared to open-source options like DeepSeek R1, which costs less than $3 per million output tokens. The Haiku version of Claude 4—which should be more affordable—has not yet been announced.

Another Year in AI

Anthropic’s release coincides with the general availability of Claude Code, an agent command-line tool enabling developers to delegate large-scale engineering tasks directly from their terminal interface. The tool can search codebases, edit files, write tests, and commit changes to GitHub under developer supervision.

GitHub announced that Claude Sonnet 4 will serve as the foundational model for its new coding agent, GitHub Copilot. CEO Thomas Dohmke reported that early internal evaluations show improvements of up to 10% compared to the previous Sonnet version, thanks to what he calls "adaptive tool use, precise instruction following, and strong coding intuition."

This positions Anthropic in direct competition with recent releases from OpenAI and Google. Last week, OpenAI introduced Codex, a cloud-based software engineering agent, while this week, Google previewed Jules and its new Gemini model family, designed for extensive coding sessions.

Several enterprise clients have validated specific use cases. Triple Whale’s CEO AJ Orbach stated that Opus 4 "excels in text-to-SQL use cases—ranked as the best model we’ve