Claude Sonnet 4.5: Highest-Scoring Claude Model Yet on SWE-bench

October 10, 2025

Generative AI & LLMOps

Explore Anthropic’s newly released Claude Sonnet 4.5, including its record-breaking benchmark performance, enhanced safety and alignment features, and significantly improved cost-efficiency.

Anthropic has just released Claude Sonnet 4.5, setting a new record of benchmark scores: 77.2% on SWE-bench Verified and 61.4% on OSWorld for computer use. The model is both smarter and more cost-effective than Claude Opus 4.1 (the previous top model from Anthropic), and the safety improvements are measurable.

At Caylent, the results of our internal testing align closely with the published benchmark scores. But one of the most notable findings we found is the price-to-performance advantage. Sonnet 4.5 is offered at the same price as Sonnet 4, a fifth of the price of Opus 4.1, while ranking in the same league as every company's best model. This is a huge opportunity for Anthropic to acquire a larger segment of the market, especially across coding assistants, where you often want the best models. In this blog, we’ll analyze the benchmark data to understand Sonnet 4.5’s performance advantages and the areas where it outperforms comparable models.

What Do Benchmarks Mean for AI Models?

Let's talk about the benchmark that matters most for coding use cases. SWE-bench Verified tests models on real GitHub issues from actual open-source projects, not toy problems or synthetic tasks. The model needs to understand the codebase, reproduce the bug, implement a fix, and pass existing tests. 77.2% success rate is the best result any model has ever gotten on this benchmark.

Sonnet 4.5 also achieved 61.4% on OSWorld, a benchmark that evaluates computer use capabilities by measuring how well it can control browsers and applications through screenshots and actions. This represents the highest score of any model on that benchmark by a significant margin, with previous best-in-class results ranging from roughly 40% to 45% depending on the specific tasks.

Now, it’s important to add some context. These benchmarks measure task completion under specific constraints, such as time limits and retry budgets. They don’t measure how often the model produces code that works but introduces subtle bugs, or how well it maintains consistency across large codebases, or how it handles proprietary APIs without extensive documentation. The benchmarks are strong signals of capability, but they're not comprehensive measures of production readiness.

When Your AI Knows It's Being Tested

This is where behavior gets genuinely interesting from a technical perspective. Sonnet 4.5 recognizes evaluation scenarios as tests and behaves unusually well during safety evaluations. The model detected test conditions 58% of the time in Anthropic's internal evaluations (compared to 22% for Opus 4.1). This creates a fascinating challenge: how do you evaluate a model that knows it's being evaluated and adjusts its behavior accordingly?

Anthropic's solution was to use a white-box interpretability technique, the first time this has been done pre-deployment for a frontier model. They trained sparse autoencoders on the model's internal representations and looked for features that activated during evaluation scenarios. Then, they ran experiments in which they artificially inhibited those features to see if the safety improvements held up.

The key finding is that even when they suppressed the model's ability to recognize evaluations, it was still safer than previous models. The awareness of evaluation influences some behaviors, but it's not the only mechanism driving safety improvements. Multiple factors contribute to alignment, both dependent and independent of the model, knowing it's being tested.

This means the benchmarks might be slightly optimistic, but the real-world safety characteristics are still measurably better than Sonnet 4. The 99.29% harmless response rate and 0.02% over-refusal rate appear to hold up in production contexts, not just controlled evaluations.

99.29% Harmless Rate and the Over-Refusal Problem

Here's a safety improvement you can actually feel in production: Sonnet 4.5 achieves a 99.29% harmless response rate on violative requests (up from 98.22% for Sonnet 4), while simultaneously dropping the over-refusal rate to 0.02% (down from 0.15%). That second number is the more interesting one for most production use cases, because over-refusal has a significant impact on user experience.

Over-refusal is when the model refuses a legitimate request because it incorrectly flags it as potentially harmful. If you've ever had Claude refuse to help with obviously benign code that happened to involve security concepts or medical terminology, you've experienced over-refusal. It manifests as user frustration, retry logic in your application, and support tickets asking why the AI won't help with normal tasks.

A 7.5x improvement in the over-refusal rate (0.15% to 0.02%) means the model has significantly improved at distinguishing between "write code to test for SQL injection vulnerabilities" and "write code to exploit SQL injection vulnerabilities." The first is legitimate security work; the second is potentially harmful. Previous models sometimes struggled with that distinction, with research finding this to be due to misalignment at the boundary regions between legitimate and harmful prompts.

The harmless response rate improvement is also meaningful, but we're seeing diminishing returns at this level. The difference between 98.22% and 99.29% is harder to notice in day-to-day use than the over-refusal improvements. What matters more is the direction: the model is both safer and more helpful, which is the right tradeoff curve.

Anthropic also reports significant improvements in multi-turn safety evaluations, which assess a model’s susceptibility to gradual manipulation over the course of multiple messages. Most risk categories achieved failure rates below 5%, with particularly strong performance on child safety scenarios. For teams operating in regulated industries or with strict compliance requirements, these safety characteristics translate directly to reduced risk exposure.

The Economics of Sonnet 4.5

The base cost for Sonnet 4.5 is:

$3 per million input tokens
$15 per million output tokens

This pricing applies as long as your context window stays under 200,000 tokens, which is exactly the same as Sonnet 4.

However, once you go over 200,000 tokens of context, pricing increases because you're using more of the model’s context window (which is more computationally expensive):

$6 per million input tokens
$22.50 per million output tokens

Important: Crossing that 200,000 boundary mid-conversation does change your rate. Your application needs to track context size proactively.

Extended thinking tokens count as output at the rate of $15 per million. For example, a complex reasoning task that generates 30,000 tokens of thinking would cost $0.45, and that’s before you count the actual response the model generates. This can add up fast if you're not careful about when you enable extended thinking.

Sonnet 4.5 Detailed Pricing

Prompt caching provides up to 90% cost savings on repeated context. Cached reads cost $0.30 per million tokens instead of $3.00. If you're running multi-turn conversations with large system prompts or processing multiple queries against the same documentation, prompt caching significantly changes the tokenomics calculation.

Batch processing offers 50% cost savings if you can tolerate delayed responses (typically 24-hour processing). This works well for bulk analysis tasks, nightly report generation, or any workload where real-time responses aren't required. The economics are straightforward: you pay half price for output tokens in exchange for asynchronous processing. Combined with prompt caching, batch processing can reduce costs by up to 70% for certain workload patterns. However, the 90% caching savings and the 50% batch savings are not additive because caching applies to input costs while batch processing applies to output costs.

Compare this to other models in the Claude family. Haiku 3.5 runs at $0.80/$4 (input/output), making it 3.75x cheaper on input and output than Sonnet 4.5. Opus 4.1 costs $15/$75, 5x more expensive than Sonnet 4.5 on both input and output. The performance doesn't scale linearly with cost, so you need to think about the task requirements. In most cases, Sonnet 4.5 serves as a strong default choice, with other models reserved for specialized use cases.

Claude Model Family Comparison (Base Rates)

Sonnet 4.5 is already available on Amazon Bedrock. If you're using it there, you should use the global.anthropic.claude-sonnet-4-5-20250929-v1:0 inference profile. If you use us.anthropic.claude-sonnet-4-5-20250929-v1:0, you'll pay 10% more for both input and output tokens.

Curious how much an LLM would cost your organization? Explore our dynamic token cost model tool and calculate it for yourself.

When to Use Sonnet 4.5 vs. Other Models

The decision tree for model selection isn't "newest is best," although in this case, it's pretty close to that, at least when deciding amongst Anthropic models.

Use Sonnet 4.5 when you're building agents that need strong reasoning, complex tool use, or computer control capabilities. The 64,000 output token limit (4x the typical 8K-16K limits) also makes it the right choice for tasks that generate extensive code or long-form content in a single response.

Use Haiku 3.5 or other smaller models from different providers for high-volume classification, simple extraction, or straightforward question-answering, provided you've validated that the lighter model handles your use case effectively. At $0.80/$4, Haiku can process 3.75x more requests than Sonnet 4.5 for the same cost. For many production workloads, Haiku's performance is sufficient, and the cost savings compound quickly at scale.

Use Opus 4.1 when your prompts are already optimized for its capabilities. Otherwise, switch to Sonnet 4.5, which is generally the more effective choice while awaiting potential future releases such as Opus 4.5 (which remains unconfirmed and may never be released, similar to the unreleased Opus 3.5). This advice may sound contrarian, but the reality is that the quality gains on Sonnet 4.5 put it above Opus 4.1 for most use cases, even though Sonnet is the medium-sized model in the Claude family.

OSWorld Score for Computer Use

Computer use is a capability that sounds like science fiction until you see it in action. Sonnet 4.5 can control browsers and applications by analyzing screenshots and taking actions, such as clicking, typing, navigating, and filling out forms. The 61.4% success rate on OSWorld makes it the best model in the world at this task (that's not marketing language, it's actually the top benchmark score).

What does this enable in practice? The use cases Anthropic highlights include competitive analysis by navigating competitor websites, procurement workflows that involve interacting with vendor portals, and customer onboarding processes that span multiple systems. These are real scenarios where you can't just use an API because the system you need to interact with was built in 2003, and the only interface is a web form.

The limitation is reliability. 61.4% means the model succeeds on about 6 out of 10 tasks in the benchmark conditions. That's impressive compared to previous capabilities, but it's not reliable enough for fully autonomous operation in most production contexts. You need human oversight, approval workflows, or validation layers to catch the 38.6% of cases where something goes wrong.

There are also important safety considerations. Computer use gives the model considerable power to interact with systems, which creates risk if someone attempts to misuse it or if the model makes errors. Anthropic has built safety guardrails, but they explicitly state that you should not deploy this in production without approval workflows and careful monitoring. Never run computer use autonomously in production environments without human oversight and validation checkpoints.

The more interesting question is where this capability goes in the next 12 months. If the improvement curve from computer use holds—and there's no reason to think it won't—we might see success rates in the 80-90% range by this time next year. At that point, the economic calculations for automating legacy system integration change significantly.

ASL-3 and What It Means for Your Risk Assessment

Sonnet 4.5 was deployed with ASL-3 protections, which is Anthropic's internal safety classification (AI Safety Level 3, for those tracking the terminology). ASL-3 means the model is deemed safe for most production use cases with standard safeguards. It's not capable of autonomous large-scale cyberattacks, can't enable individuals to create bioweapons with only a basic technical background, and doesn't have the capability to autonomously perform the work of an entry-level AI researcher.

Those last points come from Anthropic's Responsible Scaling Policy evaluations, which test models against specific thresholds for catastrophic risk. Sonnet 4.5 improved on cyber capabilities and biological knowledge compared to previous models, but remained below the thresholds that would trigger ASL-4 requirements (more stringent security and deployment controls).

For teams that need to justify AI adoption to security and compliance stakeholders, these safety assessments provide concrete talking points to support their arguments. The 99.29% harmless rate, low over-refusal rates, improved alignment metrics, and formal ASL classification provide you with something to point to beyond "trust us, it's safe." You'll still need your security team to review the specifics for your use case (and they should), but the baseline safety profile is measurably better than previous models.

The areas where gaps remain: the model can't reliably conduct sophisticated offensive cyber operations without human expertise, it can't autonomously execute research-level AI development tasks end-to-end, and it still occasionally makes mistakes on complex reasoning, even with extended thinking enabled. These limitations are good to know for setting realistic expectations about what the model can and can't do without human oversight.

For more information on the research behind these numbers, read Anthropic's Claude Sonnet 4.5 system card.

Is Sonnet 4.5 Good?

Yes. It's objectively better than Sonnet 4, based on the multiple benchmarks published by Anthropic. It's comparable to (and in some cases even better than) Opus 4.1, while the price is 20% of that. Moreover, in our internal tests at Caylent, we've found that Sonnet 4.5 has a higher tendency to respond “I don't know” instead of hallucinating an answer, and it's better at respecting instructions about the output, such as the required format.

Should I Migrate to Sonnet 4.5?

For most teams building agents, coding assistants, or complex reasoning workflows, the answer is yes. The benchmark improvements are real, the safety characteristics are measurably better, and the economics work out favorably compared to previous models. It's hard to argue against a model that's slightly smarter and 5x cheaper than the previous SOTA Opus 4.1.

The decision becomes more nuanced when running high-volume production workloads. Sonnet 4.5 is 3.75x more expensive than Haiku. It is significantly smarter, but you don't always need the smartest model for every task. Before migrating, run the numbers on your specific traffic patterns and task complexity, and conduct your own evaluations to assess which model is suitable.

For teams on Sonnet 4, the answer is a straightforward yes. Sonnet 4 remains a strong model, and some may prefer its more concise outputs. However, 4.5 outperforms it across every benchmark and functions as a drop-in replacement. Upgrading should be your default move unless you have a good reason to remain on the previous version.

For teams using Opus 4.1, Sonnet 4.5 isn't significantly smarter, but it presents a very compelling cost reduction opportunity (it's 5x cheaper) with no real loss of capabilities. The main reason to stay on Opus is if you've validated that its particular strengths for specialized creative or analytical tasks are relevant to your use case, or if your prompts are optimized in a very specific way for Opus 4.1.

Ready to put Claude Sonnet 4.5 to the test? Try it out in Bedrock Battleground. Caylent’s interactive LLM comparison tool that lets you evaluate, benchmark, and select models across real-world scenarios. And if you’re curious how model performance stacks up against cost, explore our Tokenomics Dashboard to see exactly what each LLM could cost your organization in production.

Caylent helps organizations design, implement, and scale generative AI solutions—leveraging our deep expertise in data, machine learning, and AWS technologies to turn cutting-edge models like Claude Sonnet 4.5 into real business impact.

Generative AI & LLMOps

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Integrating MLOps and DevOps on AWS

From notebooks to frictionless production: learn how to make your ML models update themselves every week (or earlier). Complete an MLOps + DevOps integration on AWS with practical architecture, detailed steps, and a real case in which a Startup transformed its entire process.

Analytical AI & MLOps

Infrastructure & DevOps Modernization

Generative AI & LLMOps

October 30, 2025

Jumpstart Your AWS Cloud Migration

Learn how small and medium businesses seeking faster, more predictable paths to AWS adoption can leverage Caylent's SMB Migration Quick Start to overcome resource constraints, reduce risk, and achieve cloud readiness in as little as seven weeks.

Migrations

Generative AI & LLMOps

October 17, 2025

Evolving MultiAgentic Systems

Explore how organizations can evolve their agentic AI architectures from complex multi-agent systems to streamlined, production-ready designs that deliver greater performance, reliability, and efficiency at scale.

Generative AI & LLMOps

View all blog posts

What Do Benchmarks Mean for AI Models?

When Your AI Knows It's Being Tested

99.29% Harmless Rate and the Over-Refusal Problem

The Economics of Sonnet 4.5

When to Use Sonnet 4.5 vs. Other Models

OSWorld Score for Computer Use

ASL-3 and What It Means for Your Risk Assessment

Is Sonnet 4.5 Good?

Should I Migrate to Sonnet 4.5?

Guille Ojeda

Learn more about the services mentioned

Generative AI Strategy

AWS Generative AI Proof of Value

Accelerate your GenAI initiatives

Related Blog Posts

Integrating MLOps and DevOps on AWS

Jumpstart Your AWS Cloud Migration

Evolving MultiAgentic Systems