Anthropic Releases Claude Opus 4.6 with 1 Million Token Context Support

2026-02-06

Anthropic has introduced Claude Opus 4.6, an upgraded version of its flagship model, Opus 4.5. This iteration marks the first time an Opus model features a one-million-token context window, which is currently in beta testing.

However, larger context windows present a known challenge: as the volume of information a model must process increases, its performance tends to degrade, a phenomenon researchers term "contextual decay." Anthropic claims to have mitigated this issue by enhancing the model's architecture and introducing a new "compression" feature that automatically summarizes older context before the window reaches capacity.

In the MRCR v2 benchmark, which evaluates a model's ability to locate specific information within extensive texts, Opus 4.6 achieved a score of 76% with a one-million-token context. In contrast, the smaller Sonnet 4.5 model scored only 18.5% under the same conditions.

The model is accessible via claude.ai, its API, and major cloud platforms. Standard pricing is set at $5 per million input tokens and $25 per million output tokens. For prompts exceeding 200,000 tokens, premium rates apply: $10 per million input tokens and $37.50 per million output tokens.

Opus 4.6 Outperforms GPT-5.2 on Knowledge Work Benchmarks

On the GDPval-AA benchmark, which assesses knowledge work in fields like finance and law, Opus 4.6 achieved an Elo rating of 1606. This surpasses OpenAI's GPT-5.2 (1462) by 144 Elo points and exceeds its predecessor, Opus 4.5 (1416), by 190 points.

In the "Human Last Exam," a multidisciplinary reasoning test, the model scored 53.1% with tool usage, leading all competitors. Opus 4.6 also attained a score of 65.4% on the agent-based coding benchmark Terminal-Bench 2.0. On the BrowseComp benchmark, which measures the ability to find hard-to-locate online information, it reached 84%. As always, benchmark results offer only a rough approximation of real-world performance.

The company has also focused on enhancing the model's programming capabilities. According to Anthropic, Opus 4.6 demonstrates more thorough planning, sustains autonomous task execution for longer periods, and operates more reliably within large codebases. On the SWE-bench, Opus 4.6 with standard prompting did not outperform Opus 4.5. However, with tailored prompting, it achieved a slightly higher score (81.42%).

Models can sometimes overthink simple tasks. Opus 4.6 exhibits a tendency to more frequently verify its conclusions—a behavior researchers refer to as overthinking—which may lead to higher costs and longer response times for straightforward queries. For simpler tasks, Anthropic recommends adjusting the effort parameter from "high" down to "medium."

New API Features and Office Integrations

Anthropic is rolling out several new API features. "Adaptive Reasoning" allows the model to determine when deeper analysis is necessary. The "Compression" feature automatically summarizes older dialogue as it approaches the context window limit. Maximum output is now 128,000 tokens. Within Claude Code, users can now employ "Agent Teams," where multiple AI agents work in parallel to complete tasks. This feature is currently in research preview.

For office users, Anthropic has updated its Excel integration and released a PowerPoint integration in research preview. The company states that within Excel, Claude can now process unstructured data, determine the appropriate structure, and execute multi-step changes in a single operation.

Prompt Injection Vulnerabilities Persist

Anthropic states that performance improvements have not compromised security. In automated behavior audits, Opus 4.6 exhibited a low incidence of problematic behaviors, such as deception or aiding misuse. However, Opus 4.6 is more susceptible to indirect prompt injection attacks compared to its predecessor, a particular concern for agent-based AI applications.

Notably, Anthropic has discontinued reporting results for direct prompt injections, an area where Opus 4.5 previously performed best among numerous weaker results. The company justifies removing this metric by stating that direct injections "involve a malicious user, whereas this section focuses on third-party threats that hijack a user's original intent." This implies the model's security may be lower than the reported data suggests.