2025 GenAI Whitepaper

Prompt Caching: Saving Time and Money in LLM Applications

Generative AI & LLMOps

Explore how to use prompt caching on Large Language Models (LLMs) such as Amazon Bedrock and Anthropic Claude to reduce costs and improve latency.

If you've been working with large language models (LLMs) for any amount of time, you know that two major pain points always come up: cost and latency. Every token processed costs money, and if you're sending the same context repeatedly (like that massive document your users keep asking questions about), you're essentially throwing money away on redundant computations. And let's not even get started on latency, nobody wants to wait seconds (or worse, tens of seconds) for an LLM to respond in an interactive application.

That's why prompt caching is such an important technique to understand. It can dramatically cut costs and speed up responses when working with LLMs like Claude on Amazon Bedrock. In this article, I'll dive deep into what prompt caching is, how it works under the hood, and most importantly, how to implement it to reduce your AWS bill while improving your user experience.

What is Prompt Caching?

Prompt caching is a technique for improving the efficiency of LLM queries by storing and reusing frequently used prompt content. In simple terms, it lets you save repeated prompt prefixes (like system instructions or reference documents) so the model doesn't have to reprocess them on subsequent requests.

Think of it like this: normally when you send a prompt to an LLM, the model reads through the entire text from start to finish, token by token, building up its internal representation of the context. If you send the same content multiple times, the model repeats this work each time, unnecessarily burning compute and your money. With prompt caching, you can tell the model "remember this part, we'll reuse it later," and on subsequent requests, it can skip directly to processing just the new content.

Amazon Bedrock's implementation of prompt caching allows you to designate specific points in your prompts as "cache checkpoints." Once a checkpoint is set, everything before it is cached and can be retrieved in later requests without being reprocessed. But how exactly does the model "remember" this content? The actual mechanism involves preserving the model's internal state, the attention patterns and hidden state vectors that represent the processed tokens, so they can be loaded instead of recalculated on subsequent requests.

Want to get reeeeally technical? One compute intensive calculation in generative pretrained transformer models is the attention calculation, which determines how a model “attends” to tokens in the input and how those tokens relate to each other. The attention calculation depends on 3 values: the K, Q, and V tensors. Once the K and V tensors are calculated for a particular input, they don't change, so a common technique to speed up inference is to cache the K and V tensors. That's what we're really caching here, not the input itself (i.e. the 50 pages of plain text), but the numerical values of the K and V tensors in the matrix that is the neural network, in its state after it processed the input.

Benefits of Prompt Caching

The benefits of prompt caching are substantial and directly address the two pain points I mentioned earlier:

Decreased latency: By avoiding redundant processing of identical prompt segments, response times can improve dramatically. Amazon Bedrock reports up to 85% faster responses for cached content on supported models. For interactive applications like chatbots, this can mean the difference between a fluid conversation and a frustratingly laggy experience. And if you're building agentic applications where the model might make multiple API calls in sequence, that latency reduction compounds with each step.

Reduced costs: This is where things get really interesting. When you retrieve tokens from cache instead of reprocessing them, you pay significantly less, cached tokens typically cost only about 10% of the price of regular input tokens. That means a potential 90% cost reduction for the cached portion of your prompts. For applications that repeatedly use large chunks of context (think document QA with a 100-page manual), this translates to a lot of savings. We'll do the math later to show exactly how much you can save.

Improved user experience: Faster responses and the ability to maintain more context for the same cost means you can build more responsive, more contextually aware applications. Your users won't necessarily know you're using prompt caching, but they'll certainly notice the snappier responses and more coherent conversations. Note: Caching doesn't increase the max context window, but in my experience you end up using more context because it's a lot cheaper.

How Prompt Caching Works

Let's dive into the mechanics of how prompt caching actually functions in Amazon Bedrock. Understanding these details will help you implement caching effectively and squeeze the maximum benefit from it.

When prompt caching is enabled, Bedrock creates "cache checkpoints" at specific points in your prompts. A checkpoint marks the point at which the entire preceding prompt prefix (all tokens up to that point) is saved in the cache. On subsequent requests, if your prompt reuses that same prefix, the model loads the cached state instead of recomputing it.

Technically speaking, what's happening is quite fascinating. The model doesn't just store the raw text, it preserves its entire internal state after processing the cached portion. Modern LLMs like Claude use transformer architectures with hundreds of layers of attention mechanisms. When processing text, each layer generates activation patterns and state vectors that represent the semantic content and relationships in the text. These layers normally need to be calculated sequentially from scratch for every prompt.

With caching, the model saves all these complicated attention patterns and hidden states (specifically, the K and V tensors) after processing the cached portion. When you send a subsequent request with the same cached prefix, instead of running all those calculations again, it loads the saved state and picks up from there. It's like giving the model a photographic memory of its own thinking process, so it can go back to a half-thought thought and rethink the parts after that, instead of starting the rethinking from scratch.

The cache in Bedrock is ephemeral with a default Time To Live (TTL) of 5 minutes. Each time a cached prefix is reused, this timer resets, so as long as you're actively using the cache, it stays alive. If no request hits the cache within 5 minutes, it expires, and the saved context is discarded. You can always start a new cache by re-sending the content, but for optimal efficiency, you'll want to structure your application to reuse cached content within that window. And no, there's no way to extend this TTL beyond 5 minutes, I've looked into it, and it's a hard limit in the current implementation. I'm not sure exactly why 5 minutes, but the cached state is stored in GPU memory, not in something relatively cheap like disk. That's why it's so limited compared to other caches we're used to, and why it costs 25% more for a TTL of just 5 minutes.

There are a few important technical constraints to be aware of:

Token thresholds: Cache checkpoints are subject to a minimum token requirement. You can only create a checkpoint after the prompt (plus any model-generated response so far) reaches a certain length, which varies by model. For example, Anthropic's Claude 3.5 requires around 1,024 tokens of combined conversation before the first checkpoint can be set. Subsequent checkpoints can be created at further intervals up to a model-defined maximum number of checkpoints (often up to 4 for large models).

For Claude 3.5 models specifically, the thresholds work like this:

  • First checkpoint: Minimum 1,024 tokens
  • Second checkpoint: Minimum 2,048 tokens
  • Third checkpoint: Minimum 4,096 tokens
  • Fourth checkpoint: Minimum 8,192 tokens

For the Amazon Nova models there is no minimum tokens to cache, though they're limited to 1 checkpoint.

If you attempt to insert a checkpoint too early (before the minimum tokens), the request will still succeed but that checkpoint is simply not added. This is important to understand because it means your caching strategy might not be working as expected if you're placing checkpoints too early in the prompt.

Cache limits: Different models support different numbers of cache checkpoints. For instance, Claude 3.5 allows up to 4 checkpoints, while Nova allows just one (intended for the system prompt). The number of checkpoints matters because it determines how many distinct segments of your prompt can be cached independently.

Model support: Not every model supports caching in the same way. During the preview phase, Amazon Bedrock enabled caching on specific models, such as Claude 3.5 versions and Amazon's Nova family, and in certain regions. Always check the latest Bedrock documentation for your model to understand its caching capabilities.

Prompt caching works with multiple Amazon Bedrock inference APIs:

  1. Converse / ConverseStream API: For multi-turn conversations (chat-style interactions), you can designate cache checkpoints within the conversation messages. This allows the model to carry forward cached context across turns.
  2. InvokeModel / InvokeModelWithResponseStream API: For single-turn prompt completions, you can also enable prompt caching and specify which part of the prompt to cache. Bedrock will cache that portion for reuse in subsequent invocations.
  3. Bedrock Agents: If you're using Amazon Bedrock Agents for higher-level task orchestration, you can simply turn on prompt caching when creating or updating an agent. The agent will then automatically handle caching behavior without additional coding.

Prompt Caching Examples

To illustrate how to use prompt caching with Amazon Bedrock, let's look at a few concrete examples. I'll show both the conceptual structure and actual code implementation.

Example 1: Multi-turn Conversation with Document QA

Imagine you're building a document QA system where users can upload a document and then ask multiple questions about it. Without caching, you'd need to send the entire document with each new question, incurring full costs every time. With caching, you can send it once and reuse it.

Here's how you might implement this using the Bedrock Converse API:

In this example, the first call to converse processes and caches the document. For the second question, we include the document again in the conversation history, but because it was cached, Bedrock will retrieve it from cache instead of reprocessing it. This significantly speeds up the response and reduces cost.

I've added error handling and logging to this code snippet to check if the caching is working as expected. This is pretty irrelevant for a code sample, but in production you'll want to know if your caching strategy is effective.

If you're curious about whether the caching is working, you can check the usage metrics in the response:

In the first call, you'll see a large number in cacheWriteInputTokens (the document's tokens). In the second call, those same tokens will show up in cacheReadInputTokens, indicating they were pulled from cache rather than being processed again.

Example 2: Single-turn Prompt with Reusable Instructions

For applications that use single-turn completions with lengthy instructions, you can use the InvokeModel API with caching. Here's a more complete example with Claude:

On the second request, even though we include the same instructions, they'll be retrieved from cache instead of being reprocessed. This saves both time and money.

The Cost Savings of Prompt Caching

Now let's talk about what probably interests you most: the money. Just how much can prompt caching save you? The answer is potentially a lot, depending on your usage patterns.

Here's how the cost structure works:

  1. Cache writes: The first time you send a prompt segment and cache it (a cache write), you pay a higher price, usually 25% higher.
  2. Cache reads: Subsequent uses of that cached segment are cache reads, each of which costs only about 10% of the normal input token rate. This is where the major savings happen.
  3. No storage fees: There's no separate charge for keeping data in the cache, you only pay the read/write token fees. The cached context remains available for the 5-minute TTL at no extra cost.

Let's work through an example to see the cost implications with real numbers:

Imagine you have a 20,000-token document that users frequently ask questions about. Without caching, if users ask 10 questions in a session, you'd process that document 10 times, for a total of 200,000 input tokens.

With prompt caching:

  • First question: 20,000 tokens processed and cached at full price
  • Next 9 questions: 20,000 tokens retrieved from cache each time, billed at 10% of normal rate

Let's break down the math with Claude 3.5 Sonnet's pricing:

Input tokens:

  • $0.003 per 1K tokens without caching
  • $0.00375 per 1K tokens for a cache write
  • $0.0003 per 1K tokens for a cache read

Output tokens: $0.015 per 1K tokens (remain unchanged with caching)

Without caching:

  • 10 requests × 20,000 input tokens = 200,000 input tokens
  • 200,000 input tokens × $0.003 per 1K tokens = $0.60

With caching:

  • First request: 20,000 input tokens × $0.00375 per 1K tokens = $0.075
  • Next 9 requests: 9 × 20,000 input tokens × $0.0003 per 1K tokens = $0.054
  • Total: $0.075 + $0.054 = $0.129

Total savings: $0.60 - $0.129 = $0.471, or about 78%

But wait, let's scale this up to a more realistic scenario. Say you have an enterprise document QA system handling 1,000 queries per day against 10 different documents (each 20,000 tokens). Assuming each document gets 100 queries:

Without caching:

  • 1,000 requests × 20,000 input tokens = 20,000,000 input tokens
  • 20,000,000 input tokens × $0.003 per 1K tokens = $60.00 per day
  • Monthly cost: $60.00 × 30 = $1,800

With caching:

  • 10 documents × 1 initial request × 20,000 tokens × $0.00375 per 1K tokens = $0.75
  • 10 documents × 99 cached requests × 20,000 tokens × $0.0003 per 1K tokens = $5.94
  • Total daily cost: $0.75 + $5.94 = $6.69
  • Monthly cost: $6.69 × 30 = $200.70

Monthly savings: $1,800 - $200.70 = $1,599.30, or about 89%

That's over $20,000 in savings per year just from implementing prompt caching! And this doesn't even account for the latency improvements and better user experience.

The math gets even more compelling with larger documents or more frequent reuse within the 5-minute window. For applications that handle many similar requests, the savings can add up to thousands of dollars per month.

It's worth noting that even with the cache write overhead, prompt caching is almost always more cost-effective as long as you reuse the cached content at least once. The break-even point is very low, making this a no-brainer optimization for most LLM applications.

Use Cases for Prompt Caching

Now that we understand how prompt caching works and the cost benefits it offers, let's explore some practical use cases where it shines particularly bright.

Conversational agents

Chatbots and virtual assistants benefit tremendously from prompt caching, especially in extended conversations where consistent background context is needed.

In conversational AI, you typically supply a fixed "system" prompt that defines the assistant's persona, capabilities, and behavioral guidelines. Without caching, the model processes this same block of instructions on every user turn, adding latency and cost each time.

With Bedrock's prompt caching, you can cache these instructions after the first turn. The model will reuse the cached instructions for subsequent user messages, effectively remembering its "personality" without re-reading it from scratch.

This is particularly valuable for assistants that maintain state over a conversation, like a customer service bot that has access to account details or order history. You can cache this user-specific context once and handle many Q&A turns quickly and cheaply.

For example, if I'm building a travel assistant that helps users plan trips, I might have:

  • A lengthy system prompt describing the assistant's capabilities (2,000 tokens)
  • The user's travel preferences and constraints (1,000 tokens)
  • Previously discussed destinations and itineraries (3,000 tokens)

Without caching, each user message would require reprocessing all 6,000 tokens. With caching, subsequent messages only need to process the new user input, potentially saving thousands of tokens per turn. And in a typical conversation with 10-20 turns, these savings add up quickly.

Coding assistants

AI coding assistants often need to include relevant code context in their prompts. This could be a summary of the project, the content of certain files, or the last N lines of code being edited.

Prompt caching can significantly optimize AI-driven coding help by keeping that context readily available. A coding assistant might cache a large snippet of code once, then reuse it for multiple queries about that code.

For instance, if a developer uploads a module and asks "Explain how this function works," followed by "How could I optimize this loop?", the assistant can pull the cached representation of the code instead of reprocessing it. This leads to faster suggestions and lower token usage.

The assistant effectively "remembers" the code it has seen, which makes interactions more fluid and natural. Bedrock's caching allows the assistant to maintain state over multiple interactions without repeated cost.

For programming environments, the caching can be particularly valuable when dealing with large files or complex projects. A developer might want to ask multiple questions about the same file or function, and with caching, the assistant doesn't need to reanalyze the entire codebase for each query. This can lead to significantly faster response times and a more seamless development workflow.

Large document processing

One of the most compelling use cases is any scenario involving large documents or texts that the model must analyze or answer questions about. Examples include:

  • Document QA systems (asking questions about PDFs)
  • Summarization or analysis of research papers
  • "Chat with a book" style services
  • Contract analysis tools

These often involve feeding very long text (tens of thousands of tokens) into the model. Prompt caching lets you embed the entire document in the prompt once, cache it, and then ask multiple questions without reprocessing the full text every time.

In Amazon Bedrock, you would send the document with a cachePoint after it on the first request. The model will cache the document's representation. Subsequent requests can all reuse the cached content.

This drastically cuts down response time since the model is essentially doing a lookup for the document content from cache, and it reduces cost because those thousands of tokens are billed at the cheap cache-read rate.

In practical terms, caching enables real-time interactive querying of large texts. Users can get answers almost immediately even if the document is huge, because the expensive step of reading the document was done only once.

Consider a legal contract review application: A typical contract might be 30-50 pages, or roughly 15,000-25,000 tokens. Without caching, each question about the contract would require sending all those tokens again. With caching, the first question might take a few seconds to process, but subsequent questions could return answers in under a second, at a fraction of the cost. This makes the difference between a clunky, expensive tool and a responsive, cost-effective one.

Limitations and edge cases

While prompt caching offers significant benefits, it's important to understand its limitations and potential edge cases:

5-minute TTL constraint: The 5-minute time-to-live for cached content is a hard limit. If your application has longer periods of inactivity between related requests, the cache will expire, and you'll need to reprocess the full prompt. This can be particularly challenging for user-facing applications where users might take breaks or get distracted. To mitigate this, you might implement a background process that periodically "refreshes" important caches with dummy requests.

Minimum token thresholds: As mentioned earlier, cache checkpoints can only be set after meeting model-specific minimum token counts. If your prompt is shorter than this threshold, caching won't work. This means caching is less beneficial for very short prompts or the beginning of conversations.

Identical prefix requirement: Caching only works if the exact same prefix is reused. Even minor changes or edits to the cached content will result in a cache miss. This can be problematic for applications that need to make small updates to otherwise stable context.

Debugging complexity: When prompt caching isn't working as expected, debugging can be challenging. The cache mechanisms are largely opaque, and you're limited to the usage metrics in the response to determine if caching is working. This can make troubleshooting production issues more difficult.

Regional availability: Prompt caching might not be available in all AWS regions where Bedrock is offered, which could affect your application's architecture if you need multi-region deployment.

Cross-session limitations: Cache is not shared across different sessions or users. Each conversation or session has its own independent cache, so you can't benefit from caching across different users even if they're accessing the same content.

Despite these limitations, the benefits of prompt caching typically outweigh the drawbacks for most use cases. Being aware of these constraints will help you design more robust applications that can handle edge cases gracefully.

Architecture considerations

When implementing prompt caching in production systems, there are several architectural considerations to keep in mind:

Integration with existing AWS services: Prompt caching works well within the broader AWS ecosystem. You can trigger Bedrock calls with cached prompts from Lambda functions, Step Functions workflows, or EC2 instances. For high-throughput applications, consider using SQS to queue requests and process them with appropriate caching strategies.

Cache warming strategies: For applications with predictable usage patterns, you might implement "cache warming" by proactively sending requests that establish caches for commonly used content just before they're needed. For example, if you know users typically query certain documents during business hours, you could warm those caches at the start of the day.

Handling cache misses: Your architecture should gracefully handle cache misses, whether due to TTL expiration or first-time access. This might involve monitoring response times and token costs, and dynamically adjusting your application flow based on whether a cache hit occurred. Sometimes there is nothing you can do about it, but just acknowledging that is an architecture consideration.

Session management: Since caches are session-specific, your architecture needs to maintain session context effectively. This might involve storing conversation state in DynamoDB or another database and ensuring that related requests use the same session identifiers.

Redundancy planning: Since caching is ephemeral and can fail, your architecture should never depend critically on caching working perfectly. Always design with fallbacks that can handle uncached requests, even if they're slower or more expensive.

Monitoring and optimization: Implement Amazon CloudWatch metrics to track cache hit rates, cost savings, and latency improvements. This data can help you refine your caching strategy over time and identify opportunities for further optimization.

A well-designed architecture that leverages prompt caching effectively can lead to significant cost savings while maintaining or improving performance. The key is to integrate caching into your application flow in a way that's resilient to cache misses and expiration.

Implementing prompt caching: best practices

Now that we've covered the what, why, and how of prompt caching, let's talk about some best practices to get the most out of it:

Structure your prompts with caching in mind: Place static content (like instructions, reference material, or documents) early in the prompt, followed by a cache checkpoint, then the dynamic content (like the specific question). This makes the static portion cacheable. For example, in a document QA system, place the document text first, followed by a cache checkpoint, then the user's question.

Place checkpoints strategically: Put cache checkpoints at logical boundaries (right after large reference material or system prompts) once the minimum token count is met. For multi-turn conversations, insert checkpoints after turns where significant context was added. Remember that different models have different minimum token requirements for checkpoints, so design accordingly.

Monitor your cache usage: Check the cache read/write metrics to verify your cache checkpoints are working as expected. If you're not seeing tokens in the cache read counter, your cache might be expiring or not being set correctly. Add logging in your application to track these metrics over time and alert on unexpected patterns.

Remember the 5-minute TTL: If user interactions might pause for longer than 5 minutes, you'll need to re-send the context to re-prime the cache. Consider building a mechanism to detect when cache refreshes are needed, perhaps by tracking the timestamp of the last cache hit and proactively refreshing if approaching the TTL limit.

Test different cache placements: Experiment with where you place cache checkpoints and monitor the effects on token usage and latency. The optimal configuration may vary depending on your specific use case. A/B testing different caching strategies can help identify the most effective approach for your application.

Use multiple checkpoints when appropriate: For complex prompts with multiple distinct sections, consider using multiple cache checkpoints. This can be particularly useful for conversations that evolve over time, allowing you to cache different segments independently. For example, in a document QA system, you might have one checkpoint after the document and another after a summary or analysis. This also allows you to easily implement conversational branching: the ability to branch several conversations from a specific point in a conversation. You can save each branching point in a checkpoint, and you don't need to reprocess the entire context of the conversation before it branched.

Consider model-specific limitations: Different models support different numbers of cache checkpoints and have different minimum token requirements. Design your caching strategy with your specific model's constraints in mind. For instance, if you're using a model that only supports a single checkpoint, prioritize caching the largest or most expensive part of your prompt.

By following these practices, you can maximize the cost savings and performance benefits of prompt caching while avoiding common pitfalls.

Next steps

If you're ready to implement prompt caching in your own applications, here are some next steps to consider:

  1. Check if your models support prompt caching on Amazon Bedrock. Not all models have this capability yet, and availability may vary by region. Here is the official documentation.
  2. Analyze your current prompts to identify where caching would provide the most benefit. Look for repeated context or large prompt segments that are used across multiple requests.
  3. Implement a proof-of-concept using the examples from this article and measure the performance and cost improvements. Start with a simple use case and expand as you gain confidence in the technique.
  4. Integrate prompt caching into your broader observability and monitoring systems to track its effectiveness over time. Monitor metrics like cache hit rates, response times, and token costs to ensure your implementation is working as expected.
  5. Stay updated on the latest developments in Amazon Bedrock and prompt caching. The capabilities and pricing may evolve over time, potentially offering even more optimization opportunities.

For a more comprehensive approach to optimizing LLM applications with Claude and Amazon Bedrock, check out our LLMOps for Claude guide. It covers not just prompt caching but a full suite of best practices for building, deploying, and optimizing LLM-powered applications.

Prompt caching represents one of those rare optimizations that improves both cost and performance simultaneously. By understanding how it works and implementing it correctly, you can build more responsive, more cost-effective AI experiences for your users. The technical implementation isn't particularly complex, but the impact on your application's economics and user experience can be substantial. If you're using LLMs in production, especially with repeated context or large prompts, prompt caching should definitely be in your optimization toolkit.

Generative AI & LLMOps
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

LLMOps Strategy

Manage, optimize, and future-proof your AI initiatives with a well-defined LLMOps strategy.

Caylent Services

Artificial Intelligence & MLOps

Apply artificial intelligence (AI) to your data to automate business processes and predict outcomes. Gain a competitive edge in your industry and make more informed decisions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

Amazon Bedrock Pricing Explained

Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.

Generative AI & LLMOps

The Art of Designing Bedrock Agents: Parallels with Traditional API Design

Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.

Generative AI & LLMOps

Speech-to-Speech: Designing an Intelligent Voice Agent with GenAI

Learn how to build and implement an intelligent GenAI-powered voice agent that can handle real-time complex interactions including key design considerations, how to plan a prompt strategy, and challenges to overcome.

Generative AI & LLMOps