Caylent Catalysts™
Generative AI Strategy
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Explore the newly released Claude Sonnet 4.6, Anthropic's best general-purpose model in terms of price-performance.
Anthropic has released Claude Sonnet 4.6, and it’s the best model in terms of price-performance for teams using coding assistants. On the headline engineering benchmarks, Sonnet 4.6 hits 79.6% on SWE-bench Verified, 59.1% on Terminal-Bench 2.0, and 72.5% on OSWorld-Verified; all improvements over Sonnet 4.5, with pricing staying at $3 per million input tokens and $15 per million output tokens.
Benchmarks are useful when they resemble real work: messy repos, brittle tests, ambiguous tickets, and multi-step tool interaction under constraints. The reason SWE-bench Verified remains the most legible metric for “can this model ship code?” is that it tests on real GitHub issues that have been curated as solvable by human engineers, and success is measured by whether the proposed fix passes the repository’s tests.
Terminal-Bench 2.0 is the other benchmark worth paying attention to if you build with coding agents. It evaluates real tasks in terminal/CLI environments, which maps closely to how modern coding agents actually operate (shell commands, repo navigation, tests, build tooling, etc.). Sonnet 4.6 posts 59.1% on Terminal-Bench 2.0 (default thinking configuration).
These are the improvements:
That last improvement is the reason OSWorld-Verified is not a throwaway mention in a Sonnet 4.6 writeup. It’s the biggest single-step capability jump in the release.
OSWorld-Verified measures whether an agent can complete real computer tasks such as editing documents, browsing the web, and managing files, by controlling a live Ubuntu VM via mouse and keyboard actions. In the evaluation setup, tasks run at 1080p with up to 100 action steps per task, and the score is a first-attempt success rate averaged over multiple runs.
Sonnet 4.6 achieves 72.5% on OSWorld-Verified, which puts it within 0.2% of Opus 4.6 (72.7%) and well above Sonnet 4.5 (61.4%).
At a ~61% first-attempt success rate (Sonnet 4.5), many teams can demo computer-use agents but struggle to justify real automation because failure handling dominates the product work. However, with a ~72% first-attempt success rate (Sonnet 4.6), you’re much closer to a regime where approval workflows + retries + validation layers turn computer-use into reliable throughput.
It’s still not capable of running unattended in production, but it's one more step in changing the economics of where you can deploy agents: internal ops automation, procurement portals, legacy vendor systems, and workflows where an API either doesn’t exist or costs more than the automation is worth.
A lot of model comparison commentary fails because it collapses capability into a single “better/worse” claim. Sonnet 4.6 forces a more specific framing:
That’s near-parity on two workloads that dominate many enterprise deployments: agentic coding and computer-use automation.
However, you can see the gap in deeper reasoning-heavy benchmarks. For example, on ARC-AGI-2 (Verified), Sonnet 4.6 is 58.3% while Opus 4.6 is 68.8%. And on Terminal-Bench 2.0, Opus 4.6 leads at 65.4% vs Sonnet's 59.1%. (Source: Claude Sonnet 4.6 System Card, Table 2.1.A.)
So the comparison is useful, but only when it’s tied to the specific tasks you’re buying the model for. If your product is dominated by interactive tool use and software engineering loops, Sonnet 4.6 is close enough to Opus 4.6 that pricing and latency often become the deciding factors. If your product is dominated by hard reasoning, research-grade synthesis, or high-stakes “get it exactly right” analysis, Opus 4.6 is your best choice.
The best safety metrics are those that translate into fewer production incidents and fewer “why did it refuse that?” support tickets. Sonnet 4.6 shows concrete movement on both dimensions, but the shape of that movement is important.
On single-turn violative requests, Sonnet 4.6 records a 99.38% harmless response rate overall, compared to 97.89% for Sonnet 4.5. In practice, this means fewer incidents where the model produces content that violates usage policy, which translates directly into reduced moderation overhead and fewer safety-related escalations for teams running Claude in user-facing products.
On higher-difficulty violative prompts (harder adversarial cases), Sonnet 4.6 is at 99.40% overall, compared to 98.40% for Sonnet 4.5. For teams building products that face adversarial or edge-case inputs, such as red-teaming, security research tools, or any application where users push boundaries, this means the model is materially harder to manipulate into producing harmful outputs, even under deliberate pressure.
Over-refusal is where models often trade safety for usability, and Sonnet 4.6 is more nuanced than “it refuses less”. On straightforward benign requests, Sonnet 4.6 shows an overall refusal rate of 0.41% (vs. 0.08% for Sonnet 4.5). That's technically a higher number, which looks like a regression if you only read the table. But this metric covers only the easiest benign prompts, i.e. requests that are unambiguously safe and where virtually all modern models already perform well. The increase likely reflects calibration tradeoffs from tightening the model's behavior on harder, more ambiguous prompts, where the gains are far more significant (see below).
But on higher-difficulty benign prompts, the realistic category where legitimate requests get framed with sensitive terms, ambiguity, or edge-case structure, Sonnet 4.6 drops the refusal rate to 0.18%, while Sonnet 4.5 sits at 8.50%.
That second number is the one most teams feel in production. If you’ve ever had a model refuse a legitimate task because it sounded like “security”, “medical”, “weapons”, or “policy”, the high-difficulty benign suite is closer to that real-world boundary region. Sonnet 4.6 is dramatically better there. Moving from 8.50% to 0.18% on higher-difficulty benign prompts means that teams building in sensitive domains like healthcare, legal, security, and education will see far fewer cases where a legitimate user request gets incorrectly blocked because the model misreads the intent behind ambiguous or technical phrasing. This was one of the most common pain points with Sonnet 4.5 in production, and the ~47x reduction in false refusals on hard prompts is the single biggest quality-of-life improvement for developers working in these areas.
For agentic risk, Sonnet 4.6 shows strong results in malicious agentic coding refusal rate tests (reported as 100.0% on the relevant evaluation table).
The model is deployed with ASL-3 safeguards under Anthropic’s safety framework, which matters to teams that need a defensible baseline when security and compliance stakeholders ask about the model's safety posture.
If you deploy computer-use (browser control, app control), prompt injection is the primary failure mode where a hostile page or document convinces the agent to take actions you didn’t intend. Sonnet 4.6 materially improves here in the data that matters most: Best-of-N prompt injection attacks in browser environments.
Without safeguards, Sonnet 4.6 shows an attack success rate of 1.29% of scenarios (0.29% of attempts) under standard thinking, compared to Sonnet 4.5 at 49.36% of scenarios (16.23% of attempts). With safeguards enabled, Sonnet 4.6 improves further. Under updated safeguards, the attack success rate drops to 0.51% of scenarios and 0.08% of attempts.
That’s a strong improvement curve, but it does not remove the need for standard agent hardening:
Sonnet 4.6's price is the same as Sonnet 4.5:
Sonnet 4.6 uses adaptive thinking, and the API exposes an effort parameter. The default effort is high, and medium is recommended as a practical balance between capability and latency/cost. Thinking tokens are billed as output tokens, so if a request triggers 30,000 thinking tokens, the thinking component alone costs $0.45.
That’s not a problem when the request replaces an hour of engineering time. It is a problem when the request is to rename a variable. It’s often best to set effort appropriately and avoid paying for reasoning you don’t need.
Just like Sonnet 4.5 and Opus 4.6, Sonnet 4.6 supports a 1M token context window (beta). Even with the 1M context flag enabled, requests with ≤ 200K input tokens are charged at standard rates. If your request exceeds 200K input tokens, all tokens (input and output) are charged at premium long-context rates. For Sonnet 4.6, the pricing is $6 / million input tokens and $22.50 / million output tokens.
The 200K threshold is based solely on input tokens. Output length does not determine whether you cross into premium pricing (but output will be billed at the premium rate once you do).
Prompt caching is still the highest-leverage cost optimization for repeated large context. Cache read tokens cost 0.1 * base input price, 5-minute cache write tokens cost 1.25 * base input price, and 1-hour cache write tokens cost 2 * base input price.
Batch processing is the other major lever when immediate responses are not required. The Batch API pricing for Sonnet 4.6 is $1.50 per million input tokens and $7.50 per million output tokens, 50% of the on-demand price.
Sonnet 4.6 is available across Claude’s first-party products and the API, and on Amazon Bedrock under the model ID anthropic.claude-sonnet-4-6. Bedrock offers global and regional endpoints, with regional endpoints including a 10% pricing premium over global endpoints. The global inference profile ID for Sonnet 4.6 is global.anthropic.claude-sonnet-4-6.
Model selection should be driven by capability requirements, not release recency. Sonnet 4.6 just makes that decision easy in many cases.
For most teams running coding agents or computer-use workflows, the answer is yes. Sonnet 4.6 achieved meaningful gains on SWE-bench Verified, Terminal-Bench 2.0, and OSWorld-Verified. It delivers substantially stronger performance on difficult prompt injection scenarios, materially better behavior in higher-difficulty over-refusal conditions, and comes with no increase in base token pricing.
The main operational adjustment is cost discipline around thinking and long context: Sonnet 4.6 can spend a lot of output tokens reasoning, and the 1M context capability is only “cheap” if you stay under the 200K input threshold.
If you're evaluating where Claude Sonnet 4.6 fits in your architecture or deciding how to balance capability, safety, and cost in production, now is the right time to reassess your model strategy. The gains in agentic coding, computer-use automation, and prompt injection resilience meaningfully shift what’s viable at scale. If you’d like help benchmarking Sonnet 4.6 in your environment, tuning effort and long-context usage, or deploying it securely on Amazon Bedrock, our team can help you design, validate, and optimize for real production outcomes. Reach out to us today to get started.
Guille Ojeda is a Senior Innovation Architect at Caylent, a speaker, author, and content creator. He has published 2 books, over 200 blog articles, and writes a free newsletter called Simple AWS with more than 45,000 subscribers. He's spoken at multiple AWS Summits and other events, and was recognized as AWS Builder of the Year in 2025.
View Guille's articlesCaylent Catalysts™
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Caylent Catalysts™
Accelerate investment and mitigate risk when developing generative AI solutions.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsExplore the newly launched Claude Opus 4.6, Anthropic's most intelligent model to date.
AI agents represent the next evolution of APIs, but they also bring new security challenges and attack vectors. Examine real-world adversarial threats and learn defensive strategies in this blog.
Discover hard-earned lessons we've learned from over 200 enterprise GenAI deployments and what it really takes to move from POC to production at scale.