Caylent Launches Dedicated Anthropic Practice

From Prompt Edits to Performance Loops: Hands-On with Amazon Bedrock AgentCore Optimization

Generative AI & LLMOps

Amazon Bedrock AgentCore now gives teams a native way to generate, validate, and test changes to agent behavior using traces, evaluations, configuration versions, and gateway-based A/B experiments. Caylent evaluated the feature through private-beta access. This article presents the results of those evaluations and what they mean for teams building on Bedrock.

As we build and maintain AI agents, the bottleneck for high quality is how well we understand the impact of the changes we make, and how easily we can iterate on prompts and configurations based on that understanding.

A prompt that performs well during launch can weaken as user behavior changes, tools evolve, policies shift, and models are updated. New edge cases appear in production. Tool descriptions that looked clear in development may become ambiguous under real traffic. This results in teams inspecting traces, adjusting prompts, rerunning a few examples, and redeploying. However, they often do this with limited evidence that the change actually improves the agent across the workflows that matter. They're throwing stuff at the wall, and often even lack the ability to see what sticks.

Amazon Bedrock AgentCore addresses that operational gap with new agent quality optimization capabilities. Currently offered in public preview, these capabilities for continuously improving agent performance through data-driven configuration changes. It uses agent traces to generate improvements and validate them with controlled experiments, and introduces three core capabilities: Recommendations, Configuration Bundles, and A/B testing.

Caylent evaluated AgentCore optimization capabilities through private-beta access before public preview, with very good results. These are useful additions to the AgentCore operating model, especially for teams already building production agents on AWS. The caveat is that it should not be treated as an automatic prompt optimizer, nor does it replace evaluation engineering. Its value is that it makes agent improvement more structured, testable, and operational, while automatically plugging in observability data so you can evaluate improvements.

Executive Takeaways

The agent quality optimization capabilities in AgentCore help teams move from intuition-led prompt edits to an evidence-backed improvement loop: observe behavior, evaluate performance, generate candidate changes, validate offline, test under live traffic, and promote only when the data supports it.

In Caylent’s private-beta evaluation, calibrated scoring showed meaningful improvement in two workloads, and a smaller, more mixed improvement in a third workload created by Caylent for the evaluation. Evaluators were calibrated against workload-specific gold cases, which is what we recommend before teams use them as promotion gates.

The main advantage we have observed with AgentCore is that it enforces a structured optimization loop, a practice Caylent has recommended to our customers since 2022. That loop includes real data, and this is where these new AgentCore capabilities shine: For agents running in AWS's stack (with runtime, Amazon CloudWatch, evaluations), that data is plugged into the loop with next-to-zero effort and is immediately usable.

What These Agent Optimization Capabilities Add

AgentCore connects several parts of AWS's agentic AI ecosystem into a repeatable improvement loop.

The first capability is Recommendations. Recommendations use AI to generate optimized agent configurations from real session traces. Teams point the service at agent traces, specify a target evaluator as the reward signal, and receive an optimized configuration. There are two types of recommendations: system-prompt recommendations, which are meant to improve the system prompt used in the agent's calls to LLMs, and tool-description recommendations, for the descriptions of the tools an agent has access to. Note that recommendations are generated by LLMs and should be reviewed and tested before they are applied.

The second capability is Configuration Bundles. A Configuration Bundle is a versioned, immutable snapshot of dynamic agent configuration, including system prompts, model IDs, tool descriptions, and other key-value pairs that the agent reads at runtime. Bundles decouple configuration from code, enabling teams to change how the agent responds without redeploying application code, provided the runtime is designed to read configuration dynamically.

The third capability is A/B testing. AgentCore A/B testing splits live production traffic between two variants through gateway in AgentCore. Assignment is sticky by runtime session ID, online evaluation scores each session, and the service reports per-evaluator metrics such as mean score, absolute and percent change, p-value, confidence interval, and significance flag. Variants can be different Configuration Bundle versions within the same runtime, or different Gateway targets that point to different runtime endpoints.

Together, those capabilities enable optimization within an operating loop rather than through individual edits.

From Manual Edits to Evidence-Backed Changes

Most teams already have some version of an agent improvement process. They inspect failures, adjust prompts or tool descriptions, run a test set, and deploy a new version. The limitation is often operational discipline.

Common failure modes include promoting a prompt because it fixed a few memorable failures, improving one workflow while silently regressing another, optimizing to a generic score that does not reflect business quality, and changing behavior through code deployments when configuration would be cleaner.

AgentCore addresses these gaps by making the path to improvement explicit: 

  • Recommendations generate candidate changes from observed traces.
  • Configuration Bundles make prompt, tool-description, model, and runtime configuration changes versionable. 
  • Offline evaluation can test candidates before live exposure. 
  • Gateway A/B testing can compare variants under production traffic patterns.

AgentCore helps teams generate and validate candidates inside the AWS-native environment where the agent runs.

The Importance of Configuration Bundles

In many agent systems, prompts and tool descriptions live too close to application code. Changing a system prompt may require a new deployment. Testing two prompt variants may require separate runtime builds or custom routing. Rolling back behavior may require reverting code instead of reverting the configuration.

Configuration Bundles create a cleaner separation. A bundle version can hold the dynamic configuration that controls agent behavior. The deployed runtime can stay the same while different requests receive different configuration versions.

This pattern matters for three reasons. First, it makes experiments cleaner. When the change is configuration-only, a control and treatment can run through the same runtime with different bundle versions.

Second, it makes rollbacks cleaner. Because bundle versions are immutable, teams can reference a previous version if a candidate underperforms.

Third, it creates configuration lineage. Bundle versions form a history of behavior-changing configuration. That history is useful for understanding what changed and when. However, this should not be confused with full AWS CloudTrail audit coverage. At the time of writing, AWS documentation states that the preview APIs of these agent optimization capabilities in AgentCore do not support AWS CloudTrail and should not be used for workloads requiring an AWS CloudTrail audit trail until support is added.

Configuration Bundles are optional. You can validate changes by packaging them as a bundle version or by deploying a separate runtime endpoint. Using different endpoints as targets is appropriate when the change includes code changes, a framework upgrade, or an entirely different agent implementation. For changes that can be managed through configuration, we recommend decoupling them from the code and moving them into Configuration Bundles.

Pre-Requisites for the Agent Optimization Loop

Recommendations and A/B testing have the same agent requirements as AgentCore evaluations: an agent deployed on AgentCore runtime with observability enabled, or an agent built with a supported framework configured with AgentCore’s end-to-end traceability. The documented supported frameworks are Strands Agents and LangGraph with OpenTelemetry or OpenInference instrumentation. The prerequisites also include CloudWatch Transaction Search, telemetry in CloudWatch Logs, current SDK or CLI versions, and IAM permissions for the optimization features.

How Caylent Evaluated AgentCore Optimization Capabilities

Caylent evaluated these new optimization capabilities via private beta access before the public preview. Our goal was to answer a practical implementation question: Can this workflow generate useful candidate improvements, and what evaluation discipline is required before teams should promote those changes?

We tested across three generated evaluation workloads:

  1. A retail-support sample workload
  2. An independent, Caylent-generated retail-support holdout set
  3. A Caylent-created passenger-rail support workload

Each generated case defined expected facts, expected actions, prohibited claims, and prohibited actions. We scored paired baseline/control and treatment responses across three response replicates per condition. The mean score weighted fact coverage, action correctness, and prohibited-avoidance behavior. Confidence intervals were computed using paired bootstrap over case-level means.

These datasets were generated for the evaluation. They are not public benchmarks, and the results should not be read as universal claims about every agent or every domain.

The Results

Across all three workloads, the treatment improved the mean score. The retail-support sample showed the strongest result: the mean score improved from 0.856 to 0.962, and the primary pass rate improved from 71.4% to 92.2%. The independent retail-support holdout confirmed the same direction: mean score improved from 0.849 to 0.926, and primary pass rate improved from 64.6% to 78.8%.

The passenger-rail workload also showed a positive mean-score movement, from 0.715 to 0.759, but the result was more mixed. Primary pass-rate lift was inconclusive because the confidence interval crossed zero, and case-level results included both improvements and regressions.

The mean-score lift was strongest in the retail-support sample, remained positive in the retail-support holdout, and was smaller in the passenger-rail workload.

Conclusions

These numbers indicate that AgentCore generated useful candidates across all tested workloads. The results show that not every recommendation should be promoted, and that every agent and domain will improve uniformly. However, AgentCore showed both the ability to generate improvements and to measure them, even allowing it to reject suggested improvements.

These agent quality tools in AgentCore should be adopted with a measurement plan, not treated as an automatic promotion engine. Our evaluation supports several practical conclusions:

  • AgentCore can generate useful prompt and tool-description candidates from traces.
  • Configuration Bundles can make behavior changes easier to version, test, and roll back when the runtime reads configuration dynamically.
  • Gateway A/B testing provides an AgentCore-native path for controlled live-traffic experiments.
  • Workload-specific scoring showed a strong lift in the retail-support sample and confirmed directionally positive lift in the retail-support holdout.
  • The passenger-rail workload showed that the workflow can be applied outside the sample domain, but the results were smaller and more mixed.

AgentCore doesn't remove the need for human judgment. Its value lies in giving teams a better operating loop for applying judgment.

Recommended Use Cases for AgentCore Optimization Capabilities

If the team has not yet defined what quality means for the agent, that's not a tooling limitation, but rather an evaluation limitation. If a team cannot define “good”, no optimization system can reliably select the right candidate.

Best Practices for Optimizing Agent Quality with AgentCore

1. Define the target metric. The recommendation process needs a reward signal. Teams should decide whether they are optimizing for task completion, policy adherence, tool-use accuracy, helpfulness, escalation behavior, cost, latency, or a composite measure.

2. Define guardrail metrics. Some behaviors should not regress even if the target metric improves. Examples include refund policy compliance, avoidance of prohibited actions, escalation of high-risk requests, privacy-preserving behavior, and refusal of unsupported tasks.

3. Create or curate representative evaluation cases. Production traces are valuable, but teams also need targeted cases for workflows that are rare, high-risk, or business-critical.

4. Treat recommendations as candidate diffs. Review what changed. Check whether the recommendation narrowed or expanded the agent’s behavior in ways that match the intended policy.

5. Validate offline before live traffic. Batch evaluation and targeted test sets can catch obvious regressions before production exposure.

6. Use Gateway A/B testing where traffic supports it. AgentCore A/B testing can compare variants using live traffic, but online experiments still require sufficient volume, a meaningful metric, and a promotion threshold the team trusts.

7. Promote only when the target metric improves, and guardrails hold. A treatment that improves the average score but regresses a critical category should be edited, retested, or rejected.

What to Know During Public Preview

At the time of writing, these new AgentCore optimization capabilities are in public preview. AWS documentation states that features and APIs may change before general availability. The documentation also states that there is no separate charge for these capabilities themselves, although customers pay for the underlying AgentCore capabilities they use.

The AWS CloudTrail limitation is the main operational caveat to plan around. AWS documentation states that AgentCore optimization capabilities preview API calls do not appear in CloudTrail event history or in configured trails, and that teams should not use these features for workloads that require a CloudTrail audit trail until support is added.

That does not reduce the value of the feature for appropriate preview use cases, but it should influence workload selection and governance planning. Still, it's worth testing these new capabilities during public preview, considering that support for CloudTrail is expected to be added.

Conclusion

These new optimization capabilities in AgentCore are a meaningful step toward operationalizing improvements to production agents on AWS. Its most useful contribution is the improvement loop: traces inform recommendations, recommendations become versioned candidates, candidates can be validated offline, and live traffic can be split through AgentCore gateway before promotion.

Caylent’s hands-on evaluation found that this loop can produce useful improvements, especially when paired with workload-specific scoring and category-level guardrails. Two of our test workloads showed strong results, and another workload showed a smaller and more mixed improvement, which is exactly the kind of result teams should expect from real evaluation work: useful signal, not blanket certainty.

For organizations building agents on AWS, the new agent optimization capabilities in AgentCore provide a practical path away from intuition-led prompt edits and toward evidence-backed behavior changes. The teams that benefit most will be the ones that bring a mature evaluation strategy to the feature: clear target metrics, calibrated cases, guardrail categories, offline validation, and disciplined promotion criteria.

Agent optimization using AgentCore should not be treated as an automatic quality engine. It should be treated as an AWS-native operating loop to improve agents using evidence and human judgment.

As an AWS Premier Tier Services Partner and AWS Partner of the Year 2025 in the Generative AI category, Caylent helps organizations design, evaluate, and operationalize production-grade agentic systems on AWS, including the evaluation strategy and platform patterns required to use AgentCore responsibly. Reach out to us today to get started. 

Generative AI & LLMOps
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Senior Innovation Architect at Caylent, a speaker, author, and content creator. He has published 2 books, over 200 blog articles, and writes a free newsletter called Simple AWS with more than 45,000 subscribers. He's spoken at multiple AWS Summits and other events, and was recognized as AWS Builder of the Year in 2025.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

Claude Opus 4.7 Deep Dive: Capabilities, Migration, and the New Economics of Long-Running Agents

Explore Claude Opus 4.7, Anthropic’s most capable generally available model, with stronger agentic coding, high-resolution vision, 1M context, and a migration story that matters almost as much as the benchmark scores.

Generative AI & LLMOps

The Heirloom Syntax: Why AI Monocultures Threaten the Future of Innovation

Explore how the rise of AI-generated content is creating a fragile monoculture of ideas, and why preserving human originality and diverse thinking is essential for long-term innovation and resilience.

Generative AI & LLMOps

Building a Secure RAG Application with Amazon Bedrock AgentCore + Terraform

Learn how to build and deploy a secure, scalable RAG chatbot using Amazon Bedrock AgentCore Runtime, Terraform, and managed AWS services.

Generative AI & LLMOps