Val Henderson Appointed to CEO to Drive Next Era of AI Growth

Solutions
- By Service
  AWS Foundations & Migrations
  Data Modernization & Analytics
  Artificial Intelligence & MLOps
  Application Modernization
  Cloud Native App Dev
  Infrastructure & DevOps Modernization
  Generative AI
  Product Strategy and Experience
  Managed Services
  View all
- By Industry
  Healthcare & Life Sciences
  Financial Services
  Media & Entertainment
  SaaS & ISV
  Transportation and Logistics
  Energy, Power and Utilities
  Education Technology
  Private Equity
  View all
- By Type
  AWS Control Tower
  AI Innovation Engine
  MLOps Strategy
  Serverless Data Lake
  Serverless App
  Disaster Recovery Strategy
  View all
- How We Work
  The Caylent Way
  A Human-Centered Approach with an AI-First Delivery Mindset
  Learn More
Resources
- Insights
  The latest Caylent announcements, industry news, insights, and more.
  View All
- Events
  New
  See what’s on, come and say hi to us IRL (or via URL).
  View All
- Case Studies
  See how others are migrating and modernizing on AWS.
  View All
- Customer Spotlight
  Teamfront
  Achieves a 10x Faster Database Migration with Caylent Accelerate™
  Learn More
Company
- About
  Read about our vision, our story and our leadership.
  About us
- Partnerships
  We keep good company. Learn more about our partners.
  Learn more
- Careers
  We're Hiring!
  We’re forever on the lookout for talented people. Join us, it’ll be fun.
  Learn more
- Recent Announcements
  New Solution
  Modernize Legacy Applications Faster with Agentic Fleets
  Read more
Caylent Accelerate™
- Caylent Accelerate™ for Application Modernization
  Transform legacy applications through rewrite, upgrade, and refactor paths powered by coordinated agents.
  Learn More
- Caylent Accelerate™ for Cloud Migration
  Cut VMware costs and migrate to AWS in half the time with end-to-end AI acceleration.
  Learn More
- Caylent Accelerate™ for Database Modernization
  Say goodbye to licensing costs like Oracle and SQL Server by migrating to AWS.
  Learn More
- Caylent Accelerate™ Portfolio
  Our Agentic Delivery Approach
  AI-powered delivery that transforms the way enterprises modernize and migrate to AWS.
  Learn More

Claude Sonnet 4.6 in Production: Capability, Safety, and Cost Explained

Guille Ojeda

February 25, 2026

Generative AI & LLMOps

Explore the newly released Claude Sonnet 4.6, Anthropic's best general-purpose model in terms of price-performance.

Generative AI & LLMOps

Guille Ojeda

Guille Ojeda is a Senior Innovation Architect at Caylent, a speaker, author, and content creator. He has published 2 books, over 200 blog articles, and writes a free newsletter called Simple AWS with more than 45,000 subscribers. He's spoken at multiple AWS Summits and other events, and was recognized as AWS Builder of the Year in 2025.

View Guille's articles

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Building a Secure RAG Application with Amazon Bedrock AgentCore + Terraform

Learn how to build and deploy a secure, scalable RAG chatbot using Amazon Bedrock AgentCore Runtime, Terraform, and managed AWS services.

Generative AI & LLMOps

March 9, 2026

Why Flat Tool Architectures Fail and How Amazon Bedrock AgentCore Enables Production-Grade

As enterprise AI systems scale, flat tool architectures create complexity, cost, and security risks. Explore how hierarchical architectures with Amazon Bedrock AgentCore solve the problem.

Generative AI & LLMOps

March 4, 2026

Whitepaper: The 2026 Outlook on Generative AI

Generative AI & LLMOps

View all blog posts

Anthropic has released Claude Sonnet 4.6, and it’s the best model in terms of price-performance for teams using coding assistants. On the headline engineering benchmarks, Sonnet 4.6 hits 79.6% on SWE-bench Verified, 59.1% on Terminal-Bench 2.0, and 72.5% on OSWorld-Verified; all improvements over Sonnet 4.5, with pricing staying at $3 per million input tokens and $15 per million output tokens.

What the Benchmarks Actually Tell You

Benchmarks are useful when they resemble real work: messy repos, brittle tests, ambiguous tickets, and multi-step tool interaction under constraints. The reason SWE-bench Verified remains the most legible metric for “can this model ship code?” is that it tests on real GitHub issues that have been curated as solvable by human engineers, and success is measured by whether the proposed fix passes the repository’s tests.

Terminal-Bench 2.0 is the other benchmark worth paying attention to if you build with coding agents. It evaluates real tasks in terminal/CLI environments, which maps closely to how modern coding agents actually operate (shell commands, repo navigation, tests, build tooling, etc.). Sonnet 4.6 posts 59.1% on Terminal-Bench 2.0 (default thinking configuration).

These are the improvements:

SWE-bench Verified: 79.6% (Sonnet 4.6) vs 77.2% (Sonnet 4.5) → +2.4 points
Terminal-Bench 2.0: 59.1% (Sonnet 4.6) vs 51.0% (Sonnet 4.5) → +8.1 points
OSWorld-Verified: 72.5% (Sonnet 4.6) vs 61.4% (Sonnet 4.5) → +11.1 points

That last improvement is the reason OSWorld-Verified is not a throwaway mention in a Sonnet 4.6 writeup. It’s the biggest single-step capability jump in the release.

OSWorld-Verified Is the Big Story

OSWorld-Verified measures whether an agent can complete real computer tasks such as editing documents, browsing the web, and managing files, by controlling a live Ubuntu VM via mouse and keyboard actions. In the evaluation setup, tasks run at 1080p with up to 100 action steps per task, and the score is a first-attempt success rate averaged over multiple runs.

Sonnet 4.6 achieves 72.5% on OSWorld-Verified, which puts it within 0.2% of Opus 4.6 (72.7%) and well above Sonnet 4.5 (61.4%).

At a ~61% first-attempt success rate (Sonnet 4.5), many teams can demo computer-use agents but struggle to justify real automation because failure handling dominates the product work. However, with a ~72% first-attempt success rate (Sonnet 4.6), you’re much closer to a regime where approval workflows + retries + validation layers turn computer-use into reliable throughput.

It’s still not capable of running unattended in production, but it's one more step in changing the economics of where you can deploy agents: internal ops automation, procurement portals, legacy vendor systems, and workflows where an API either doesn’t exist or costs more than the automation is worth.

The Opus Comparison

A lot of model comparison commentary fails because it collapses capability into a single “better/worse” claim. Sonnet 4.6 forces a more specific framing:

On SWE-bench Verified, Sonnet 4.6 is 79.6% vs Opus 4.6 at 80.8%.
On OSWorld-Verified, Sonnet 4.6 is 72.5% vs Opus 4.6 at 72.7%.

That’s near-parity on two workloads that dominate many enterprise deployments: agentic coding and computer-use automation.

However, you can see the gap in deeper reasoning-heavy benchmarks. For example, on ARC-AGI-2 (Verified), Sonnet 4.6 is 58.3% while Opus 4.6 is 68.8%. And on Terminal-Bench 2.0, Opus 4.6 leads at 65.4% vs Sonnet's 59.1%. (Source: Claude Sonnet 4.6 System Card, Table 2.1.A.)

So the comparison is useful, but only when it’s tied to the specific tasks you’re buying the model for. If your product is dominated by interactive tool use and software engineering loops, Sonnet 4.6 is close enough to Opus 4.6 that pricing and latency often become the deciding factors. If your product is dominated by hard reasoning, research-grade synthesis, or high-stakes “get it exactly right” analysis, Opus 4.6 is your best choice.

Safety Improvements

The best safety metrics are those that translate into fewer production incidents and fewer “why did it refuse that?” support tickets. Sonnet 4.6 shows concrete movement on both dimensions, but the shape of that movement is important.

Harmlessness on Violative Requests

On single-turn violative requests, Sonnet 4.6 records a 99.38% harmless response rate overall, compared to 97.89% for Sonnet 4.5. In practice, this means fewer incidents where the model produces content that violates usage policy, which translates directly into reduced moderation overhead and fewer safety-related escalations for teams running Claude in user-facing products.

On higher-difficulty violative prompts (harder adversarial cases), Sonnet 4.6 is at 99.40% overall, compared to 98.40% for Sonnet 4.5. For teams building products that face adversarial or edge-case inputs, such as red-teaming, security research tools, or any application where users push boundaries, this means the model is materially harder to manipulate into producing harmful outputs, even under deliberate pressure.

Over-refusal

Over-refusal is where models often trade safety for usability, and Sonnet 4.6 is more nuanced than “it refuses less”. On straightforward benign requests, Sonnet 4.6 shows an overall refusal rate of 0.41% (vs. 0.08% for Sonnet 4.5). That's technically a higher number, which looks like a regression if you only read the table. But this metric covers only the easiest benign prompts, i.e. requests that are unambiguously safe and where virtually all modern models already perform well. The increase likely reflects calibration tradeoffs from tightening the model's behavior on harder, more ambiguous prompts, where the gains are far more significant (see below).

But on higher-difficulty benign prompts, the realistic category where legitimate requests get framed with sensitive terms, ambiguity, or edge-case structure, Sonnet 4.6 drops the refusal rate to 0.18%, while Sonnet 4.5 sits at 8.50%.

That second number is the one most teams feel in production. If you’ve ever had a model refuse a legitimate task because it sounded like “security”, “medical”, “weapons”, or “policy”, the high-difficulty benign suite is closer to that real-world boundary region. Sonnet 4.6 is dramatically better there. Moving from 8.50% to 0.18% on higher-difficulty benign prompts means that teams building in sensitive domains like healthcare, legal, security, and education will see far fewer cases where a legitimate user request gets incorrectly blocked because the model misreads the intent behind ambiguous or technical phrasing. This was one of the most common pain points with Sonnet 4.5 in production, and the ~47x reduction in false refusals on hard prompts is the single biggest quality-of-life improvement for developers working in these areas.

Agentic Misuse and ASL-3

For agentic risk, Sonnet 4.6 shows strong results in malicious agentic coding refusal rate tests (reported as 100.0% on the relevant evaluation table).

The model is deployed with ASL-3 safeguards under Anthropic’s safety framework, which matters to teams that need a defensible baseline when security and compliance stakeholders ask about the model's safety posture.

Prompt Injection on Computer Use

If you deploy computer-use (browser control, app control), prompt injection is the primary failure mode where a hostile page or document convinces the agent to take actions you didn’t intend. Sonnet 4.6 materially improves here in the data that matters most: Best-of-N prompt injection attacks in browser environments.

Without safeguards, Sonnet 4.6 shows an attack success rate of 1.29% of scenarios (0.29% of attempts) under standard thinking, compared to Sonnet 4.5 at 49.36% of scenarios (16.23% of attempts). With safeguards enabled, Sonnet 4.6 improves further. Under updated safeguards, the attack success rate drops to 0.51% of scenarios and 0.08% of attempts.

That’s a strong improvement curve, but it does not remove the need for standard agent hardening:

Tool permissioning (least privilege)
Domain allowlists and safe browsing constraints
Human approval for irreversible actions (purchases, submissions, credential entry)
Structured auditing of every action and tool call
Sandboxing (separate browser profiles, ephemeral sessions, restricted filesystem)

The Economics of Sonnet 4.6

Sonnet 4.6's price is the same as Sonnet 4.5:

$3 per million input tokens
$15 per million output tokens

Sonnet 4.6 uses adaptive thinking, and the API exposes an effort parameter. The default effort is high, and medium is recommended as a practical balance between capability and latency/cost. Thinking tokens are billed as output tokens, so if a request triggers 30,000 thinking tokens, the thinking component alone costs $0.45.

That’s not a problem when the request replaces an hour of engineering time. It is a problem when the request is to rename a variable. It’s often best to set effort appropriately and avoid paying for reasoning you don’t need.

Just like Sonnet 4.5 and Opus 4.6, Sonnet 4.6 supports a 1M token context window (beta). Even with the 1M context flag enabled, requests with ≤ 200K input tokens are charged at standard rates. If your request exceeds 200K input tokens, all tokens (input and output) are charged at premium long-context rates. For Sonnet 4.6, the pricing is $6 / million input tokens and $22.50 / million output tokens.

The 200K threshold is based solely on input tokens. Output length does not determine whether you cross into premium pricing (but output will be billed at the premium rate once you do).

Prompt Caching and Batch Processing

Prompt caching is still the highest-leverage cost optimization for repeated large context. Cache read tokens cost 0.1 * base input price, 5-minute cache write tokens cost 1.25 * base input price, and 1-hour cache write tokens cost 2 * base input price.

Batch processing is the other major lever when immediate responses are not required. The Batch API pricing for Sonnet 4.6 is $1.50 per million input tokens and $7.50 per million output tokens, 50% of the on-demand price.

Availability and Deployment Notes

Sonnet 4.6 is available across Claude’s first-party products and the API, and on Amazon Bedrock under the model ID anthropic.claude-sonnet-4-6. Bedrock offers global and regional endpoints, with regional endpoints including a 10% pricing premium over global endpoints. The global inference profile ID for Sonnet 4.6 is global.anthropic.claude-sonnet-4-6.

When to Use Sonnet 4.6 vs. Other Claude Models

Model selection should be driven by capability requirements, not release recency. Sonnet 4.6 just makes that decision easy in many cases.

Use Sonnet 4.6 when

You’re building with coding assistants or agentic engineering workflows and want a strong SWE-bench / Terminal-Bench profile at a great price point.
You’re deploying computer-use agents and need a real jump in OSWorld-Verified capability (72.5%).
You want a default model that can handle long contexts (including optional 1M context) without moving into Opus pricing.

Use Haiku 4.5 when

The workload is high-volume and relatively simple (classification, extraction, routing, summarization, where failure is cheap).
Cost is the main constraint: Haiku 4.5 runs at $1 / million tokens input and $5 / million tokens output (3x cheaper than Sonnet).

Use Opus 4.6 when

You need the best Claude model for deeper, reasoning-heavy, or frontier-style evaluations, and the margin matters.
You’re willing to pay the premium: Opus 4.6 is $5 / million tokens input and $25 / million tokens output, and long-context premium pricing is higher as well.
You’re targeting tasks where Opus’s gap is real (e.g., ARC-AGI-2, Terminal-Bench).

Should You Move from Sonnet 4.5 to Sonnet 4.6?

For most teams running coding agents or computer-use workflows, the answer is yes. Sonnet 4.6 achieved meaningful gains on SWE-bench Verified, Terminal-Bench 2.0, and OSWorld-Verified. It delivers substantially stronger performance on difficult prompt injection scenarios, materially better behavior in higher-difficulty over-refusal conditions, and comes with no increase in base token pricing.

The main operational adjustment is cost discipline around thinking and long context: Sonnet 4.6 can spend a lot of output tokens reasoning, and the 1M context capability is only “cheap” if you stay under the 200K input threshold.

If you're evaluating where Claude Sonnet 4.6 fits in your architecture or deciding how to balance capability, safety, and cost in production, now is the right time to reassess your model strategy. The gains in agentic coding, computer-use automation, and prompt injection resilience meaningfully shift what’s viable at scale. If you’d like help benchmarking Sonnet 4.6 in your environment, tuning effort and long-context usage, or deploying it securely on Amazon Bedrock, our team can help you design, validate, and optimize for real production outcomes. Reach out to us today to get started.

What the Benchmarks Actually Tell You

These are the improvements:

SWE-bench Verified: 79.6% (Sonnet 4.6) vs 77.2% (Sonnet 4.5) → +2.4 points
Terminal-Bench 2.0: 59.1% (Sonnet 4.6) vs 51.0% (Sonnet 4.5) → +8.1 points
OSWorld-Verified: 72.5% (Sonnet 4.6) vs 61.4% (Sonnet 4.5) → +11.1 points

That last improvement is the reason OSWorld-Verified is not a throwaway mention in a Sonnet 4.6 writeup. It’s the biggest single-step capability jump in the release.

OSWorld-Verified Is the Big Story

Sonnet 4.6 achieves 72.5% on OSWorld-Verified, which puts it within 0.2% of Opus 4.6 (72.7%) and well above Sonnet 4.5 (61.4%).

The Opus Comparison

A lot of model comparison commentary fails because it collapses capability into a single “better/worse” claim. Sonnet 4.6 forces a more specific framing:

On SWE-bench Verified, Sonnet 4.6 is 79.6% vs Opus 4.6 at 80.8%.
On OSWorld-Verified, Sonnet 4.6 is 72.5% vs Opus 4.6 at 72.7%.

That’s near-parity on two workloads that dominate many enterprise deployments: agentic coding and computer-use automation.

Safety Improvements

Harmlessness on Violative Requests

Over-refusal

Agentic Misuse and ASL-3

For agentic risk, Sonnet 4.6 shows strong results in malicious agentic coding refusal rate tests (reported as 100.0% on the relevant evaluation table).

Prompt Injection on Computer Use

That’s a strong improvement curve, but it does not remove the need for standard agent hardening:

Tool permissioning (least privilege)
Domain allowlists and safe browsing constraints
Human approval for irreversible actions (purchases, submissions, credential entry)
Structured auditing of every action and tool call
Sandboxing (separate browser profiles, ephemeral sessions, restricted filesystem)

The Economics of Sonnet 4.6

Sonnet 4.6's price is the same as Sonnet 4.5:

$3 per million input tokens
$15 per million output tokens

The 200K threshold is based solely on input tokens. Output length does not determine whether you cross into premium pricing (but output will be billed at the premium rate once you do).

Prompt Caching and Batch Processing

Availability and Deployment Notes

When to Use Sonnet 4.6 vs. Other Claude Models

Model selection should be driven by capability requirements, not release recency. Sonnet 4.6 just makes that decision easy in many cases.

Use Sonnet 4.6 when

You’re building with coding assistants or agentic engineering workflows and want a strong SWE-bench / Terminal-Bench profile at a great price point.
You’re deploying computer-use agents and need a real jump in OSWorld-Verified capability (72.5%).
You want a default model that can handle long contexts (including optional 1M context) without moving into Opus pricing.

Use Haiku 4.5 when

The workload is high-volume and relatively simple (classification, extraction, routing, summarization, where failure is cheap).
Cost is the main constraint: Haiku 4.5 runs at $1 / million tokens input and $5 / million tokens output (3x cheaper than Sonnet).

Use Opus 4.6 when

You need the best Claude model for deeper, reasoning-heavy, or frontier-style evaluations, and the margin matters.
You’re willing to pay the premium: Opus 4.6 is $5 / million tokens input and $25 / million tokens output, and long-context premium pricing is higher as well.
You’re targeting tasks where Opus’s gap is real (e.g., ARC-AGI-2, Terminal-Bench).

Guille Ojeda

Accelerate your GenAI initiatives

Related Blog Posts

Building a Secure RAG Application with Amazon Bedrock AgentCore + Terraform

Why Flat Tool Architectures Fail and How Amazon Bedrock AgentCore Enables Production-Grade

Whitepaper: The 2026 Outlook on Generative AI

Accelerate your GenAI initiatives

Guille Ojeda

What the Benchmarks Actually Tell You

OSWorld-Verified Is the Big Story

The Opus Comparison

Safety Improvements

Harmlessness on Violative Requests

Over-refusal

Agentic Misuse and ASL-3

Prompt Injection on Computer Use

The Economics of Sonnet 4.6

Prompt Caching and Batch Processing

Availability and Deployment Notes

When to Use Sonnet 4.6 vs. Other Claude Models

Use Sonnet 4.6 when

Use Haiku 4.5 when

Use Opus 4.6 when

Should You Move from Sonnet 4.5 to Sonnet 4.6?

Learn more about the services mentioned

Generative AI Strategy

AWS Generative AI Proof of Value

What the Benchmarks Actually Tell You

OSWorld-Verified Is the Big Story

The Opus Comparison

Safety Improvements

Harmlessness on Violative Requests

Over-refusal

Agentic Misuse and ASL-3

Prompt Injection on Computer Use

The Economics of Sonnet 4.6

Prompt Caching and Batch Processing

Availability and Deployment Notes

When to Use Sonnet 4.6 vs. Other Claude Models

Use Sonnet 4.6 when

Use Haiku 4.5 when

Use Opus 4.6 when

Should You Move from Sonnet 4.5 to Sonnet 4.6?

Learn more about the services mentioned

Generative AI Strategy

AWS Generative AI Proof of Value