Caylent Accelerate™

Evolving MultiAgentic Systems

Generative AI & LLMOps

Explore how organizations can evolve their agentic AI architectures from complex multi-agent systems to streamlined, production-ready designs that deliver greater performance, reliability, and efficiency at scale.

As organizations shift from AI experiments to production, many discover that multi-agent orchestration, while ideal on paper, becomes a performance and cost challenge in the real world. What looks like a flexible, modular design in a proof-of-concept can quickly unravel when scaled to enterprise workloads.

Over the past year, our engineering team at Caylent has been deep in the trenches with agentic systems, not as POCs, but as real, production-scale systems. Early on, we embraced the prevailing wisdom: build lots of specialized agents, wire them together, and enjoy modularity and rapid iteration. 

The reality proved more nuanced. The line between “possible” and “practical” got sharper as we moved prototypes into the hands of customers with real-world throughput and reliability demands. But our approach had to evolve as we encountered significant issues with latency, operational overhead, and transparency, forcing us to rethink which patterns truly scale.

That evolution was shaped by hands-on work with customers. One of the most impactful examples is CloudZero, which provides a FinOps solution that enables enterprises to understand and optimize cloud costs in real time by connecting engineering decisions directly to business outcomes. CloudZero’s AI-driven FinOps platform uses autonomous agents to automate cost allocation, analyze unit economics, and deliver transparent insights across pricing, custom dimensions, and benchmarking data. This enables teams to design, operate, and continuously optimize cost-efficient architectures.

The Challenge: Orchestrating Multiple Agents in Production

As more companies onboarded to their platform, CloudZero faced growing challenges scaling their AI-driven cloud cost optimization and infrastructure analysis capabilities. Operations were increasingly slowed by complexity and high developer cognitive load required to manage and tune custom agentic AI workflows processing petabytes of cost and usage data.

Our initial implementation leveraged Amazon Bedrock Agents along with a bespoke multi-agent orchestration layer. At the time that was AWS’s recommended best practice.  This shows just how quickly agentic development is evolving as their recommendation has changed to use Strands API and Amazon Bedrock AgentCore. The architecture was intentionally modular as each agent was narrowly focused, responsible for a tight domain or a discrete workflow. In theory, this approach promised composability, reusability, and clear debugging boundaries. Amazon Bedrock Agents could hand off context, specialize in their own logic, and keep domain responsibilities cleanly separated.

This approach was considered best practice when we began. However, as the system moved closer to production, hidden costs surfaced and began to erode these gains:

  • Cumulative Latency: Each agent handoff involved new prompt construction, context serialization, and model invocation. When flows required five, ten, or more agent hops, end-user response times crept into the tens of seconds. Latency was no longer an artifact – it became an operational bottleneck.
  • Token Inflation: As agent orchestration chained together multiple prompts, total token counts soared. This increased not only operational costs but also throughput limitations and token inflation, which became both a budget and a scalability concern.
  • Caching Limitations: Unlike direct model API calls, Amazon Bedrock Agents could not benefit from Amazon Bedrock’s prompt caching. Each request, even if identical to a previous invocation, incurred full model evaluation cost and time.

In practice, the stack turned out brittle and expensive. While the logic held up, the infrastructure couldn’t reliably deliver strong results.

Amazon Bedrock Agents introduce another subtle friction: they do not provide interim feedback when processing user prompts. In most agent frameworks, each key step, especially tool invocations, is accompanied by a status message, allowing users to witness real-time progress and context as the LLM orchestrates its workflow. The absence of these progressive indicators in Amazon Bedrock Agents leaves users facing a silent, often ambiguous wait, undermining transparency and eroding confidence in the system’s responsiveness. Well-designed agentic systems should cultivate trust not only through correctness but also through visible movement. Knowing what the agent is doing, and when, is increasingly recognized as a best practice for user experience in generative AI applications.

Best Practices: From Many Agents to Robust Singleton Agents

These growing pains taught us an important lesson: the strategies that work well in prototypes can collapse under production demands. This challenge forced us to confront an emerging best practice in agentic system design. While narrow agents are appealing during prototyping, production environments demand robustness and efficiency.

Early Days: Proliferation of Specialized Agents

In the initial phase, the design mindset was to “compose” workflows by chaining together many specialized agents. Each agent had a focused responsibility –  orchestration layers became increasingly complex to knit these components into seamless business flows. Modularity drove velocity, but at the cost of growing latency and operational sprawl.

At first, the modular approach looked like a win — until we saw how it performed under real-world load. What seemed promising in theory quickly revealed its limitations in production. That realization forced a rethink. Could we achieve the same modularity without the performance tax?

Production Reality: Consolidation Into Robust Singleton Agents

The breakthrough came when we stopped thinking in terms of ‘more agents’ and started thinking about ‘smarter agents.’ After extensive testing, it became clear that one well-built agent did the job better than a fleet of specialized ones. It could handle bigger workflows, needed less context swapping, used fewer tokens, and was much easier to keep track of. Furthermore, development and monitoring footprints decreased, making root cause analysis and performance tuning dramatically simpler.

Evolving from an “agent zoo” to robust agent design is now a defining marker of maturity in agentic architectures.

Pivoting to Strands API + Amazon Bedrock AgentCore

That architectural rethink set the stage for a deeper transformation. This was not just in how agents were designed, but in the underlying infrastructure that supported them.

Faced with mounting performance and cost concerns, we re-architected the solution using two AWS-aligned technologies:

  • Strands API: An open-source SDK for multi-tool, multi-agent systems. Strands enables developers to compose complex toolchains natively in code, unlocking granular control over orchestration, error handling, and tool selection. Unlike agent middleware, Strands is engineered for production agility.
  • Amazon Bedrock AgentCore: AWS’s latest managed runtime for agentic infrastructure. Amazon Bedrock AgentCore abstracts session management, observability, identity, and storage — transforming undifferentiated engineering toil into reliable, scalable infrastructure.

This technology stack not only addressed the immediate pain points but also refactored the foundational approach to agent design.

Technical Advantages Delivered

The combination of Strands and Amazon Bedrock AgentCore represented a significant upgrade because it directly addressed some of the most pressing issues we faced in production. One of the biggest wins came from eliminating token bloat. By orchestrating flows directly in code with Strands, we no longer had to shuttle context back and forth between multiple agents. This collapsed what was once a chain of prompts into streamlined, deterministic flows. The result was a sharp reduction in token usage, which not only cut costs but also improved throughput.

Another breakthrough came with prompt caching. Strands allows direct model calls into Amazon Bedrock to leverage caching, which means repeated computations could be served in milliseconds instead of seconds. That shift dramatically reduced latency for recurring queries and created a more seamless user experience.

We also saw a step change in observability. AgentCore’s native integration with Amazon CloudWatch gave us the ability to trace requests end-to-end, collect meaningful metrics, and visualize errors at every layer of the stack. Debugging and monitoring agents in production transformed from being an ad‑hoc process into a repeatable, data-driven science.

Reducing orchestration complexity also improved reliability. With fewer execution hops, deterministic flows, and consolidated state management, the system became notably more stable. Amazon Bedrock Agents returned more accurate results, error rates declined, and overall confidence in the system rose.

The impact of these changes was immediately visible to the teams working with the platform. A UX team member said, "Now that the system is so responsive, we’ve been able to show real-time updates in the UI while the agents process complex cloud services data inquiries."

A server-side engineer said, “We got a huge improvement on the TTFT (Time To First Token). Predictable 2-4 seconds with Strands, when not hitting a cold start. With Amazon Bedrock Agents, it was unpredictable because the first token is only returned after all steps are made, and the LLM is spitting out the final answer. 2-4 seconds is the average time per step, but when using Bedrock Agents, it was around 6-10 seconds. One thing worth mentioning is that with Amazon Bedrock Agents, every tool call is also a Lambda call, which can hit cold starts. With Strands, the tool calling is just a Python function call.

Together, these advantages made Strands plus Amazon Bedrock AgentCore more than just a performance upgrade. They fundamentally reshaped how we designed, operated, and trusted agentic systems at scale.

Customer Experience and Operational Efficiency

These architectural choices translated directly into meaningful outcomes for both end users and the business:

  • Near-Instant Responses: Users interacted with the system in real time.
  • Measurably Lower Costs: Token bloat was reduced, and prompt caching saved compute cycles and money.
  • Production Monitoring: Engineering teams could diagnose performance issues, monitor workflows, and optimize resource allocation with concrete metrics and trace maps.
  • Business Agility: With fewer moving parts and more robust workflows, the customer’s internal teams could adapt business rules without fearing operational regressions.

The evolution from a proliferation of narrowly focused agents to robust, singleton agents reflects a larger industry trend. As agentic systems move mainstream, operational complexity threatens to undermine their promise. True innovation at scale comes not from building ever more complex orchestration layers, but from industrializing and abstracting the underlying agent infrastructure:

These patterns are becoming essential for teams moving agent systems from proof of concept into production-grade, enterprise use.

Conclusion: Architecting For What Matters Most

The takeaway is clear: success in production isn’t about piling on complexity, it’s about building robust, maintainable architecture. The greatest gains weren’t achieved by experimenting with complex agent frameworks, but by streamlining workflows, enhancing visibility, and applying the right automation primitives. 

The industry has overhyped agent swarms. In production, they rarely survive customer expectations. That focus on engineering for reliability, speed, and clarity lets us move at production speed without sacrificing control. The future of generative AI in the enterprise won’t be unlocked by ever-more elaborate multi-agent chains, but by building fewer, stronger agents, supported by infrastructure that makes complexity manageable rather than multiplying it. 

That shift unlocks performance, reliability, and cost efficiency — and turns promising prototypes into production-grade systems.

How Caylent Can Help 

At Caylent, we help organizations move from experimentation to production with confidence. Our team has deep, hands-on experience designing and scaling agentic AI systems, leveraging our expertise in machine learning and generative AI to build solutions that perform reliably in real-world environments. By harnessing the power of AWS technologies, our dedicated Generative AI practice empowers enterprises to accelerate innovation through offerings like our Generative AI Strategy and Generative AI Knowledge Base. Whether you’re building your first agentic workflow or evolving a complex multi-agent architecture, Caylent helps you turn intelligent automation into lasting business impact.

Generative AI & LLMOps
Brian Tarbox

Brian Tarbox

Brian is an AWS Community Hero, Alexa Champion, runs the Boston AWS User Group, has ten US patents and a bunch of certifications. He's also part of the New Voices mentorship program where Heros teach traditionally underrepresented engineers how to give presentations. He is a private pilot, a rescue scuba diver and got his Masters in Cognitive Psychology working with bottlenosed dolphins.

View Brian's articles

Learn more about the services mentioned

Caylent Catalysts™

AWS Control Tower

Establish a Landing Zone tailored to your requirements through a series of interactive workshops and accelerators, creating a production-ready AWS foundation.

Accelerate your cloud native journey

Leveraging our deep experience and patterns

Get in touch

Related Blog Posts

Claude Haiku 4.5 Deep Dive: Cost, Capabilities, and the Multi-Agent Opportunity

Explore the newly launched Claude Haiku 4.5, Anthropic's first Haiku model to include extended thinking, computer use, and context awareness capabilities.

Generative AI & LLMOps

Claude Sonnet 4.5: Highest-Scoring Claude Model Yet on SWE-bench

Explore Anthropic’s newly released Claude Sonnet 4.5, including its record-breaking benchmark performance, enhanced safety and alignment features, and significantly improved cost-efficiency.

Generative AI & LLMOps

Evaluating Contextual Grounding in Agentic RAG Chatbots with Amazon Bedrock Guardrails

Explore how organizations can ensure trustworthy, factually grounded responses in agentic RAG chatbots by evaluating contextual grounding methods, using Amazon Bedrock Guardrails and custom LLM-based scoring, to reduce hallucinations and build user confidence in high-stakes domains.

Generative AI & LLMOps