Explore Caylent’s Activities at AWS re:Invent

POC to PROD: Hard Lessons from 200+ Enterprise Generative AI Deployments, Part 2

Generative AI & LLMOps

Discover hard-earned lessons we've learned from over 200 enterprise GenAI deployments and what it really takes to move from POC to production at scale.

POC to PROD: Hard Lessons from 200+ Enterprise Generative AI Deployments

This is part 2 of a 2-part series. Part 1 contains the 6 key lessons we learned from deploying over 200 generative AI applications to production. In this part, we’ll share the specific patterns that work and the anti-patterns that consistently fail, drawn from our own experience building generative AI systems.

Our production systems are built on a relatively consistent stack of AWS services, chosen through trial and error across hundreds of deployments. At the bottom layer, we use three primary services: Amazon Bedrock, Amazon Bedrock AgentCore, and Amazon SageMaker AI. Amazon Bedrock provides managed access to foundation models from Anthropic, Meta, Mistral, and Amazon's own Nova family. Amazon Bedrock AgentCore provides a platform for AI agents, including the core runtime, security, identity and access management, memory, and observability. Amazon SageMaker AI offers more control if you need to deploy custom models, though it comes with a compute premium. For a handful of use cases, we run inference directly on Amazon ECS or Amazon EC2, especially when we need specific control over the environment.

AWS offers two pieces of custom silicon worth understanding: AWS Trainium for training and AWS Inferentia for inference. These chips provide roughly 60% better price-performance compared to Nvidia GPUs, though with two important tradeoffs:

  1. The GB RAM available on Inferentia instances is lower than what you get with an H100 or H200.
  2. You must use the AWS Neuron SDK, which is similar to writing XLA for TensorFlow TPUs.

If your models fit within these constraints and you're doing high-volume inference, the cost savings are significant. Look out especially for the new Trainium3 UltraServers, which deliver 3x higher throughput per chip and 4x faster response times compared to the previous generation.

If you need more memory or prefer Nvidia GPUs, the newly announced P6-B300 instances, powered by Nvidia Blackwell Ultra GPUs, deliver 2.1TB of GPU memory with 6.4Tbps Elastic Fabric Adapter (EFA) networking bandwidth.

For embeddings and vector search, our preferences have evolved over time. We currently favor Postgres with pgvector for most use cases with frequent access. Combining vector similarity search with traditional relational queries in a single database simplifies the architecture and provides powerful hybrid search capabilities. OpenSearch remains a solid choice, particularly when you need advanced text search features alongside vector search. For extremely low-latency requirements, Amazon MemoryDB with vector search support is impressively fast, but the cost reflects that speed since everything must sit in RAM. Amazon S3 Vectors is a fantastic option for infrequent access, but the unit economics make it more expensive if you query the vector space on average more than 10 times per minute. If you're building a large index, budget accordingly or use disk-based alternatives such as Postgres or OpenSearch that can leverage indexes more efficiently.

The component that actually differentiates your system from competitors is context management. Embeddings alone don't create a moat. If your competitor has the same documents embedded in the same vector database, you're functionally identical. But if you can inject contextual information that others can't, such as the user's current page, their browsing history, session state, interaction patterns, or business-specific metadata, you will deliver better results. This context is your competitive advantage. It's also why understanding your access patterns matters more than perfect embeddings. How will users actually search? What filters do they need? What faceted search capabilities are required? Embeddings can't answer these questions alone, which is why hybrid search that combines vector similarity with structured queries typically outperforms pure vector search in production.

Production Patterns That Work

Beyond individual lessons, certain architectural patterns have proven robust across multiple production deployments. These represent the actual systems we've built and operated at scale.

Multimodal Video Search Architecture

Nature Footage, a stock footage provider, needed semantic search across thousands of high-resolution videos and reached out to Caylent to find a solution. The technical challenge was making video content searchable via natural language queries while maintaining acceptable query latency and search quality. The solution involved several components working together.

Instead of attempting to process the entire video, which would just waste a lot of tokens, we used frame sampling to extract frames from videos at regular intervals. The sampling rate depends on video content: action-heavy videos require more frequent samples, while static shots require fewer. For each sampled frame, we generated embeddings using Amazon Titan V2 multimodal embeddings and experimented with Amazon Nova multimodal embeddings. The critical innovation is pooling these frame-level embeddings into a single video-level embedding. This pooled embedding captured the video's overall semantic content, enabling text-to-video search where users can describe what they're looking for and retrieve visually relevant content.

But embeddings alone aren't sufficient for production search. Users need to filter by resolution, duration, frame rate, licensing terms, and specific subjects. We stored embeddings alongside metadata in OpenSearch, enabling hybrid queries that combine vector similarity with structured filters. A query might be "mountain sunset" with filters for 4K resolution and specific licensing. The vector search finds semantically similar videos while filters eliminate incompatible results. This way, users receive all the results that match their query, and which also match the filters they specified.

Another customer of ours needed a similar solution to create highlights (short videos) of key plays and moments in sports events. In that case, the architecture we used was extended to support real-time processing. We separated audio and video streams, generated transcriptions from audio, and analyzed both modalities independently. One particularly effective technique was running an FFmpeg amplitude spectrograph on the audio track to identify when the crowd is cheering, which correlates strongly with highlight moments. This simple signal often outperforms complex video analysis for identifying key plays. We then used Amazon Nova Pro for frame-level understanding, generated embeddings from both transcription text and video frames, and stored these with confidence scores for detected behaviors. Push notifications via Amazon SNS alert users when relevant highlights are detected.

Video annotation significantly improves model accuracy, often more than more sophisticated approaches. For basketball footage, we drew a blue line marking the three-point line on the court. Then we asked the model simple questions such as "Did the player cross the big blue line before shooting?" This annotation took seconds per video and dramatically improved detection accuracy. You can automate some annotations using models such as Meta's SAM2 for segmentation. The lesson is that small amounts of targeted preprocessing can improve results more than complex model engineering.

Evals That Actually Matter

Evaluation frameworks evolve through predictable stages. Understanding this progression helps teams build appropriate testing for their maturity level.

The vibe check is where everyone starts. You try a few prompts, look at outputs, and decide whether they seem reasonable. This is subjective and non-repeatable, but it's not worthless. Your initial vibe check becomes your first eval. Those prompts you tested manually and the outputs you judged acceptable or unacceptable, write them down. You've just created your initial eval set.

Building an eval set means systematically varying inputs and capturing edge cases. Take your initial prompts and change the data. What happens with longer inputs? Shorter inputs? Ambiguous phrasing? Missing information? Unexpected formats? After about twenty minutes of this iteration, you have a working eval set. You don't need hundreds of examples initially; you need coverage of important cases and common failure modes.

Boolean metrics often work better than complex scoring. For each test case, define a clear success criterion. Did the system extract all required fields from the document? Yes or no. Did it route the request to the correct handler? Yes or no. This binary evaluation is easier to implement than calculating BERT scores or other semantic similarity metrics, and it's often more meaningful. A score of 0.73 requires interpretation. What does that mean? But "15 out of 20 test cases passed" is immediately clear.

Iteration with this eval framework catches regressions and validates improvements. Before changing your prompt or upgrading models, run your eval set and record results. Make your change, rerun the evals, and compare. If scores decrease, you've found a regression. If they improve, validate that the improvement generalizes beyond your eval set. This tight feedback loop prevents the "it worked yesterday" problem, where undocumented changes break previously working functionality.

The goal isn't to achieve perfect evaluation metrics, but to have enough test coverage to ship changes confidently. Remember that evals are your tests for LLM execution paths. As your system matures, your eval set grows to cover new edge cases and failure modes discovered in production. But you can start simple and evolve systematically.

Context Management at Scale

Context management creates a competitive advantage that's difficult to replicate. While embeddings and model access are increasingly commoditized, the contextual information you can provide remains unique to your business.

Consider an e-commerce platform using AI for product recommendations. Generic product embeddings based on descriptions are table stakes; every competitor has them. But contextual information creates differentiation: the user's browsing history in this session, items currently in cart, past purchases, time spent on various category pages, click patterns, device type, time of day, seasonal context, and inventory levels for relevant products. This rich context allows personalization that generic embeddings can't match.

The technical challenge is determining the minimum viable context. More context helps up to a point; beyond that, it just adds noise and cost. Use your evaluation framework to test context variations. Try removing specific context elements while monitoring whether eval performance degrades. Often, you'll discover that some context you assumed was critical doesn't actually improve results. Conversely, the context you overlooked may substantially improve quality once you inject it.

Dynamic context injection based on user state is particularly powerful. If the user is viewing a specific product page, inject details about that product, related products, and the user's interaction history with that category. If they're in checkout, inject cart contents and past purchase patterns. If they're browsing a category, inject trending items in that category and the user's past preferences. This relevance-aware context injection provides better results than always including everything.

Caching strategies balance context freshness against performance. Some context doesn't change often and can be heavily cached. Other context changes frequently and must be fresh. Separating these concerns in your prompt structure allows independent caching for stable versus dynamic portions. The stable prefix gets cached across requests, while dynamic context gets appended per request.

What Doesn't Work (And Why)

Understanding failure patterns helps avoid repeated mistakes. At Caylent, we've consistently observed these anti-patterns in struggling generative AI projects.

Undifferentiated chatbots fail because they provide no unique value. Building a generic chat interface without domain-specific context, specialized workflows, or unique capabilities means you're competing with ChatGPT, Anthropic Claude, and dozens of other generic tools. Users will just use those directly rather than your version. The business moat doesn't exist. If your competitive advantage is "we have a chatbot too," prepare for disappointment. Success requires identifying specific workflows where your domain expertise, proprietary data, or unique context creates genuine differentiation.

Ignoring end-user constraints leads to technically elegant solutions that fail in practice. We've seen this repeatedly: voice interfaces for noisy environments, high-resolution video processing for low-bandwidth users, complex multi-step workflows for time-pressed users, and sophisticated AI features that require training users don't have time for. Technical capability doesn't equal user value. Validate assumptions about user environment, constraints, and preferences before committing to an architecture.

Model selection obsession wastes time that teams should spend on fundamentals. We've watched teams spend weeks A/B testing Anthropic Claude versus ChatGPT, trying to optimize for minor differences in output quality, while their evaluation coverage remained weak and their input/output specification stayed vague. Models improve faster than evaluation cycles run. By the time you finish comparing models, new versions have launched that invalidate your results. Focus instead on prompt engineering, evaluation frameworks, and context management; these provide durable value regardless of which model you're using.

Cost blindness causes budget crises post-launch. Running your most expensive model for every request regardless of complexity, providing no unit economics tracking, and failing to monitor token usage patterns leads to surprise bills that can derail projects and kill ROI. Preventing this scenario requires instrumenting cost per request from the beginning, implementing routing to appropriately sized models, using batch processing for non-realtime workloads, and optimizing prompts for output token efficiency.

Over-engineering tool calling introduces complexity with no real benefit. Creating elaborate tool schemas for simple operations, defining tools for basic computation or data formatting, and building complex orchestration when simple prompt engineering would work, all waste development time and reduce system reliability. Every tool call adds latency and potential failure points. Minimize tools-to-operations that genuinely require external data you can't just inject into the prompt, or actions the model can't perform through reasoning alone.

Conclusion: The Real Competitive Moats

200+ enterprise generative AI deployments have forged Caylent's expertise in generative AI. We now understand that success comes from solving the right problems rather than having the fanciest architecture, and that the competitive moats that actually matter are your inputs/outputs specification, context management, UX orchestration, unit economics, and evaluation coverage.

Common traps for premature optimization include perfect model selection, complex tool calling, fine-tuning when prompt engineering is sufficient, and theoretical performance metrics divorced from real usage.

The generative AI landscape continues to evolve rapidly, but these fundamentals remain stable. Foundation models will improve, costs will decrease, and new capabilities will emerge. But if you've clearly defined the problem you're solving, built robust evaluations, gathered unique context, optimized your economics, and designed UX that works for real users in real environments, your system will adapt to these changes rather than being disrupted by them.

The gap between POC and production isn't primarily technical, it's about discipline. Discipline to specify clearly, evaluate systematically, understand your users deeply, and optimize economics honestly. Teams that maintain this discipline, regardless of which models or tools they choose, build successful production systems. Teams that skip these fundamentals to chase the latest model releases will end up wondering why their beautiful demos never scale.

Reach out to us to learn how our experts can help you get started on your generative AI project. 

Generative AI & LLMOps
Randall Hunt

Randall Hunt

Randall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.

View Randall's articles
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Senior Innovation Architect at Caylent, a speaker, author, and content creator. He has published 2 books, over 200 blog articles, and writes a free newsletter called Simple AWS with more than 45,000 subscribers. He's spoken at multiple AWS Summits and other events, and was recognized as AWS Builder of the Year in 2025.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

POC to PROD: Hard Lessons from 200+ Enterprise Generative AI Deployments, Part 1

Explore hard-earned lessons we've learned from 200+ enterprise GenAI deployments.

Generative AI & LLMOps

A Comprehensive Guide to LLM Evaluations

Explore how organizations can move beyond traditional testing to build robust, continuous evaluation systems that make LLMs more trustworthy and production-ready.

Generative AI & LLMOps

Integrating MLOps and DevOps on AWS

From notebooks to frictionless production: learn how to make your ML models update themselves every week (or earlier). Complete an MLOps + DevOps integration on AWS with practical architecture, detailed steps, and a real case in which a Startup transformed its entire process.

Analytical AI & MLOps
Infrastructure & DevOps Modernization
Generative AI & LLMOps