Caylent Catalysts™
AWS Generative AI Proof of Value
Accelerate investment and mitigate risk when developing generative AI solutions.
Discover what we’ve learned over two years of building agentic systems, from automating 30,000 facilities to streamlining enterprise cloud costs.
The industry conversation around agentic AI has reached a fever pitch, with new frameworks and orchestration patterns emerging weekly. Yet at Caylent, we've been building agent-based systems since 2023, well before the current wave of excitement. Through dozens of production deployments on Amazon Bedrock, we've learned a counterintuitive truth: the teams that succeed focus more on prompt engineering than orchestration complexity.
During a recent AWS Twitch stream, our CTO, Randall Hunt, and Senior Innovation Architect, Guille Ojeda, shared insights from two years of building agentic systems. As Randall put it, "I kind of get frustrated when people jump on this term agents because the way we think about it is the same thing we've been doing the whole time — making this stuff work in the business world. It's tool use."
Here's what we've learned from deploying agents that handle everything from building automation with 30,000 facilities to enterprise cloud cost optimization.
Before diving into architecture patterns, let's address the economic reality that shapes every production agent deployment — ROI depends not just on how well your agents work, but also on how many tokens you need to pay for if you want them to work well.
If you've been building cloud infrastructure for a while, one of your first instincts will be to optimize for this on the infrastructure side, and you'll likely quickly find out about Amazon Bedrock Provisioned Throughput. The question there is, “is Provisioned Throughput cheaper than on-demand inference?”
Through extensive testing with Amazon Bedrock, we've identified a critical threshold that determines your optimal pricing strategy — 2 million tokens per minute, blended across input and output. Below this threshold, on-demand pricing with aggressive optimization delivers superior economics. Above it, provisioned throughput becomes cheaper. But here's what most teams miss while they're focused on Provisioned Throughput — you can dramatically reduce token consumption before ever considering provisioned capacity.
Amazon Bedrock's batch inference provides an immediate 50% cost reduction for asynchronous workloads. No code changes, no prompt modifications, just a different API call and foregoing synchronous responses. Additionally, Prompt Caching, available for prompts over 1,024 tokens, can reduce costs by up to 90% for cached portions while simultaneously improving latency.
Consider our work with BrainBox AI, managing HVAC systems across 30,000 buildings. Their initial agent design consumed 4,000 tokens per request when analyzing building telemetry. Through systematic prompt optimization and strategic caching, we reduced this to 1,200 tokens while actually improving response quality. That's a 70% reduction in cost before applying batch processing or any other optimizations.
The hidden costs matter too. Complex orchestration means more debugging time, longer iteration cycles, and higher operational overhead. A simple, well-optimized agent that ships in six weeks delivers more value than a complex multi-agent system still in development after six months.
"We perhaps overindex on the orchestration layer and don't index as high as we should on the actual prompts," Randall noted during our discussion. This observation, born from painful experience, accurately describes how we at Caylent currently approach agent development.
Take BrainBox AI's challenge. Their data scientists were spending 40% of their time manually writing queries to analyze HVAC performance across their massive portfolio. The intuitive response might involve building a sophisticated multi-agent system with specialized agents for different building types, geographic regions, or system components.
Instead, we built ARIA (their AI-powered virtual building assistant), starting with the simplest possible pattern that could work. As Randall explained, "Previously, their data scientists had to go in and manually write different pieces of code to query systems. And what we built was a system that could basically spin up a code execution environment to query their various downstream systems, whether that be through Athena or through some other federated query with an OLTP system."
This simple pattern handled 90% of use cases immediately. Only after validating this approach in production did we layer in specialized handling for edge cases. The lesson is clear — start with the simplest viable architecture and let complexity emerge from actual requirements, not anticipated ones.
LangGraph has become our orchestration framework of choice precisely because it doesn't impose unnecessary structure. It provides just enough scaffolding to build reliable agents without forcing premature architectural decisions. We've used variations of this pattern for everything from video analysis to infrastructure optimization, adding complexity only when measurably necessary.
"These evals that you create are the most important part of the system," Randall explained during our conversation. "This is where you define your moat."
As Randall put it, "The evals are like the brakes on your car. They're not there to slow down. They're there so you can go faster with confidence." This means that before we write a single line of orchestration code, we build comprehensive evaluation sets that let us move quickly and confidently through optimization cycles.
Our methodology follows a proven sequence. First comes what we call the "vibe check", validating that the model can even handle the task. "After you vibe check and you prove that it's possible," Randall explained, "then you iterate and you create your evals from taking the initial prompt that worked and modifying it and building it, making it robust, making it more performant."
The evaluation set becomes your North Star. Take our work with Nature Footage on generative search over video. We started with "please describe this video" and built an evaluation set of 50 video samples with human-validated descriptions. Each prompt iteration was tested against this set, measuring accuracy, consistency, and token efficiency. "The prompts and optimizing the prompts, that's where we find the biggest improvement," Guille emphasized during our conversation. And the only way to reliably measure improvements on prompts is with evaluations.
"By having that eval set within this customer," Randall noted, "we're able to very quickly test new models and see if they're going to work or not." When Claude 3.5 Sonnet was released on Amazon Bedrock, we ran it against existing evaluation sets across all our customers. Within 48 hours, we'd identified which systems would benefit from upgrading and which should stay on their current models.
Stanford's DSP (Demonstrate-Search-Predict) framework is an excellent tool to optimize against these evaluations. "It's really, really good at it," Randall emphasized. "It really works best on text prompts... It is a really powerful framework that I'd encourage everyone to go and check out." DSP uses your evaluation set to programmatically search for optimal prompts, consistently delivering 50-80% token reductions while maintaining or improving eval scores.
The key insight was that prompt optimization isn't about adding more instructions — it's about finding the minimal set of tokens that reliably pass your evaluations, what we call the Minimum Viable Tokens. We recently reduced a customer's prompt from 3,000 tokens to 890 tokens while improving their eval pass rate from 82% to 96%. That's not just cost savings, it's faster responses that lead to a better user experience.
"You come up with some metrics that you can use to define the correctness of the system, and then you just iterate," Randall summarized. This evaluation-driven approach transforms prompt engineering from an art into a science, with measurable, reproducible results.
"The orchestration layer is almost incidental," Randall observed, challenging the industry's current obsession with complex agent topologies. Multi-agent systems have their place, but that place is much smaller than current hype suggests.
CloudZero's cost optimization platform provides an instructive example of when multi-agent architecture genuinely adds value. They needed to analyze uploaded Cost and Usage Reports, compare against industry benchmarks from similar customers, and provide specific optimization recommendations. A single agent would require an unwieldy prompt and struggle with the diverse expertise needed.
We implemented their solution using AWS's multi-agent collaboration orchestrator Amazon Bedrock Agents, but with a critical design principle — each agent has a narrow, well-defined responsibility. The supervisor agent doesn't need to understand cost optimization deeply — it only needs to route requests effectively. The cost analyzer focuses solely on CUR data analysis. The benchmark agent compares metrics against their database of similar customers.
This separation allows for independent optimization and testing of each component. But remember, CloudZero processes thousands of analyses daily with significant complexity. Most use cases don't require this level of sophistication.
It’s important to note that at the time, Amazon Bedrock Agents was the state-of-the-art solution. Currently, we might consider using the newly released Amazon Bedrock AgentCore, which was announced at the AWS Summit New York keynote.
"The agent is a distributed system," Guille explained. "You have to treat it, configure it from the start like a distributed system." This insight is crucial for production deployments, especially with multi-agent architectures.
The first and most critical step is enabling Amazon Bedrock's invocation logging. As Randall emphasized, "Without Amazon Bedrock invocation logging on, unless you have some other form of logging, it can be very difficult to debug things."
import boto3
bedrock = boto3.client('bedrock')
bedrock.put_model_invocation_logging_configuration(
loggingConfig={
'cloudWatchConfig': {
'logGroupName': '/aws/bedrock/agent-invocations',
'roleArn': 'arn:aws:iam::123456789012:role/bedrock-logging'
},
's3Config': {
'bucketName': 'my-bedrock-logs',
'keyPrefix': 'agent-traces/'
},
'textDataDeliveryEnabled': True,
'imageDataDeliveryEnabled': True
}
)
For multi-agent systems, trace correlation becomes critical. We typically implement OpenTelemetry tracing across all components, maintaining consistent trace IDs throughout the execution flow. "You need to persist it," Randall warned. "Tools have to know and be aware that it's coming from this OpenTelemetry ID."
One technique that's proven invaluable for conversation continuity on agentic systems is checkpointing the state to Amazon S3. When an agent needs to pause and resume, which is common in approval workflows or long-running analyses, we serialize the entire conversation state. This approach works particularly well with Firecracker VMs for code interpreter environments, allowing near-instant resumption of complex analytical sessions.
"We made a big bet when Anthropic released MCP, Model Context Protocol, back in November 24th of 2024," Randall recalled. "I think Anthropic got the timing right, and they got the setup right for the Model Context Protocol."
After evaluating dozens of frameworks and tools, our production stack has stabilized around a core set of technologies that consistently deliver:
LangGraph remains our primary orchestration framework. "LangGraph tends to be our orchestration layer these days," Randall confirmed, citing its balance of flexibility and structure.
DSP from Stanford has revolutionized our prompt optimization, consistently delivering 50-80% token reductions while maintaining quality.
Amazon Bedrock's Native Features provide essential capabilities without additional complexity. Knowledge Bases eliminate custom RAG implementation. Guardrails handle content filtering and PII protection. Batch inference delivers automatic 50% cost reduction for async workloads.
Direct Amazon Bedrock Converse API when we need features not yet available in Amazon Bedrock Agents. "For system prompts and tool use, the prompt caching is really, really valuable," Randall explained. "So we tend to walk the street with just regular old converse stream or converse as opposed to the Amazon Bedrock Agents runtime."
The key is starting simple and adding complexity only when measured benefits justify the operational overhead.
Based on our experience across dozens of deployments, here's the pragmatic path to production that actually works:
Start with the vibe check. Can Claude 4 Sonnet or another frontier model handle your use case with a basic prompt? If not, reconsider the approach. If yes, create your evaluation set immediately with 20-30 examples covering common cases and edge conditions.
Build the simplest viable agent. Use the Amazon Bedrock Converse API directly. Focus on prompt refinement and establishing baseline metrics. Enable invocation logging from day one, you'll thank yourself later.
Optimize aggressively. Apply DSP or manual prompt optimization, using evaluations to measure improvement. Implement caching for prompts over 1,024 tokens. Evaluate batch processing for asynchronous workloads. You should see 50-70% cost reduction at this stage without any architectural changes.
Only then consider complex architectures. If you have proven ROI and clear requirements that simple agents cannot meet, explore multi-agent patterns. But remember CloudZero's example. They needed multi-agent architecture only after reaching thousands of daily analyses with complex comparative requirements.
Watch for these warning signs that you're over-engineering:
"We've been doing this for a while now," Randall reflected. "And one of the nice things with Amazon Bedrock is that we've been able to slot in the new models as they've come out, so the system only gets better and more performant."
The teams succeeding with agentic AI aren't the ones with the most sophisticated architectures. They're the ones who started simple, measured everything, and optimized relentlessly. They understand that agents are just tool use with better packaging, that prompts matter more than orchestration, and that production systems require distributed systems thinking from day one.
Your agents don't need complex orchestration to deliver value. They need well-crafted prompts, robust evaluation sets, and thoughtful optimization. Master these fundamentals, and you'll build systems that actually ship to production and deliver measurable business value.
The path from experimentation to production is clearer than the hype suggests. It starts with a simple prompt, grows through systematic optimization, and scales through proven patterns. That path is available to you today with Amazon Bedrock and the hard-won lessons we've shared.
Navigating the complexities of agentic AI, from initial strategy through implementation and ongoing optimization, requires more than technology; it requires experience. At Caylent, we help organizations plan, build, and optimize production-ready agentic AI systems on Amazon Bedrock. As an AWS Premier Services Partner with deep expertise in generative AI, our teams guide you through every stage of development, from designing effective agent collaboration and ensuring data security to managing costs and debugging intricate interactions. The result is agentic workflows that are not only technically sound but also strategically aligned with your business objectives: scalable, secure, and cost-effective.
You can explore our Generative AI offerings on AWS to see how we help customers harness the full potential of AI, and learn more about our strategic approach through initiatives like the AWS Generative AI Strategy Catalyst.
Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.
View Guille's articlesRandall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.
View Randall's articlesCaylent Catalysts™
Accelerate investment and mitigate risk when developing generative AI solutions.
Caylent Catalysts™
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsDiscover how Amazon Q Developer is redefining developer productivity - featuring a real-world migration of a .NET Framework application to .NET 8 that transforms weeks of manual effort into just hours with AI-powered automation.
Explore how Amazon Bedrock AgentCore and the Agent Marketplace are industrializing, standardizing, and commoditizing the underlying agent infrastructure, helping organizations eliminate the operational toil and risk that have slowed the adoption of agentic systems.
Explore how agentic AI architectures can address the complexity, uncertainty, and personalization needs of modern healthcare by mirroring medical team dynamics, enabling dynamic reasoning, mitigating bias, and delivering more context-aware and trustworthy medical insights.