2025 GenAI Whitepaper

Amazon Bedrock Pricing Explained

Generative AI & LLMOps

Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.

Why Pricing Clarity Matters

Conversations about Amazon Bedrock often emphasize how quickly you can integrate generative AI into your applications. Yet behind the convenience lies a pricing structure that can surprise companies that jump in without a thorough understanding. Simply relying on guesswork about tokens, compute hours, or storage can produce AWS bills higher than anticipated.

That’s where a deeper level of pricing comprehension comes in. A conversation about Bedrock’s pricing structure spans on-demand usage, dedicated capacity, fine-tuning, custom model hosting, and more, each with its own price. If you’re an engineering manager or architect tasked with controlling costs, you need detail-oriented insights on how to plan effectively.

In the sections that follow, we’ll explore every major facet of Bedrock’s pricing, referencing real numbers from the latest (February 2025) official AWS documentation. Rather than glossing over token rates or speaking in vague cost statements, we’ll show exactly what your usage might look like in practice. By the end of this article, you’ll see how to forecast your monthly bill, whether you’re prototyping a chatbot, spinning up fine-tuned image models, or orchestrating a multi-agent solution with knowledge bases and custom model imports.

High-Level Overview of Bedrock Pricing Components

The core of Amazon Bedrock has two major approaches to inference (the process of running prompts against a foundation model): on-demand (which includes batch) and provisioned throughput. The first charges you only for the tokens or images you process, while the second reserves a certain capacity (model units) for a fixed hourly price. On top of that you’ll see additional charges if you customize models, either by training them on your own data or by importing a model that you’ve trained elsewhere.

On-Demand suits teams that can’t accurately predict usage or that expect sporadic surges, because there’s no contract or hourly minimum. By contrast, Provisioned Throughput is a better fit for workloads that are both steady and substantial, offering guaranteed performance and predictable costs.

Yet there’s more to it. Various extra charges may appear, such as data transfer fees, storage for your custom models, or evaluation tasks that run on your behalf. This overview sets the stage for a more in-depth breakdown of each cost category.

On-Demand Usage: Tokens, Images, and Batch

On-demand pricing lets you pay strictly for what you process, whether that means tokens of text or generated images. If your usage is modest or unpredictable, on-demand can keep you from overcommitting resources and paying for capacity you don’t need. But “pay as you go” can add up quickly if you run frequent or large-scale inference jobs. Let’s talk about the specific charges in On-demand.

Text and Embedding Models

When you invoke a text-based foundation model, Bedrock measures both your input tokens (the text prompt) and your output tokens (the model’s generated response). A thousand tokens might correspond to a few paragraphs, though the exact relationship depends on factors like whitespace and punctuation. Different providers charge different rates, so you should match the model’s cost structure to your specific application.

For example, Amazon Nova Micro (a smaller text generation model) charges $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens when used on-demand. If you cached a portion of the prompt for five minutes, the cached tokens cost $0.00000875 per 1,000 for input, which can produce significant savings for repeated context or instructions. Meanwhile, AI21 Labs’ Jamba 1.5 Large sets a rate of $0.002 per 1,000 input tokens and $0.008 per 1,000 output tokens. If you’re dealing with thousands of requests per day, these small figures accumulate quickly.

For embedding models, the cost typically applies only to input tokens because you’re passing text in for vector representation, not generating new text out. For instance, Amazon Titan Text Embeddings costs $0.0001 per 1,000 input tokens in on-demand mode, while Amazon Titan Multimodal Embeddings runs $0.0008 per 1,000 input tokens or $0.00006 per input image. If you switch to batch mode for the Titan embeddings, the cost drops to $0.0004 per 1,000 input tokens and $0.00003 per input image.

Image and Video Generation

Bedrock also supports generative image and video models, each priced by the piece (image) or by the second (video). For example, Amazon Nova Canvas generates up to 1024×1024 images at a rate of $0.04 per Standard-quality image or $0.06 per Premium-quality image. If you need 2048×2048 images, each generation will cost $0.06 (Standard) or $0.08 (Premium). For video, Amazon Nova Reel at 720p resolution costs $0.08 per second of generated footage.

If you’re using Stability AI’s Stable Diffusion 3.5 Large (on-demand), you’re looking at $0.08 per image. Some older versions (like SDXL 1.0) might cost $0.04 for a smaller resolution image at standard quality. Pay attention to any step count or resolution thresholds because exceeding them may shift you into a higher price tier.

Batch Processing and Discounts

When you know you’ll process large volumes of prompts or images in a single, scheduled run, batch inference can lower your per-token or per-image rate. Amazon Nova Micro text generation, for example, charges $0.0000175 per 1,000 input tokens and $0.00007 per 1,000 output tokens in batch mode, roughly half its on-demand cost.

Anthropic’s Claude 3.5 Haiku also offers batch discounts. If you process input tokens in large batches, your rate is $0.0005 per 1,000 input tokens (down from $0.0008 in pure on-demand). That means you save money when you group together tasks, such as summarizing an entire day’s worth of customer support transcripts. The trade-off is that real-time latencies are higher because batch jobs typically run asynchronously.

Provisioned Throughput (Dedicated Capacity)

Teams that handle steady or high-volume inference often opt for Provisioned Throughput. Here, you purchase model units for a specific base or custom model. Each model unit guarantees a certain token throughput per minute (or images/second, depending on the model). You’re billed hourly, whether you fully utilize that capacity or not.

No-Commit vs. 1-Month vs. 6-Month Commitments

For some models, you can provision capacity on a no-commit basis, paying an hourly rate that you can stop at any time. For example, Amazon Titan Text Lite (base or custom) charges $7.10 per hour per model with no commitment. If you commit to one month, the rate drops to $6.40 per hour, and a six-month commitment lowers it further to $5.10. Anthropic’s Claude Instant runs $44.00 per hour with no commitment, while a six-month commitment can go as low as $22.00 per hour per model unit.

Selecting the right plan depends on usage stability. If you have a new product with uncertain usage, you might start with a no-commit plan. But for an established application with predictable daily traffic, a monthly or half-year commitment can yield significant savings.

Mandatory for Fine-Tuned or Custom Models

Bedrock’s multi-tenant environment hosts popular base models in an on-demand pool. However, once you fine-tune a model or import a custom one, you typically need dedicated capacity. Amazon Titan’s official pricing states that custom models for text generation (for instance, Titan Image Generator v2 customized) cost $23.40 per hour per model with no commitment, or $21.00 with a monthly term, and $16.85 with a six-month term. Because custom models are unique to your account, AWS can’t share them with other users, so you pay for the private capacity allocated to run them.

Fine-Tuning: Costs, Storage, and Throughput

Training or fine-tuning a model is a separate cost line. For text-based models, you’ll pay a per-1,000-tokens-trained rate. For image-based or multimodal models, it might be per image processed. You’ll also pay a monthly storage fee for the resulting custom model, plus the hourly rate for any inference you run once it’s deployed.

Token-Based Example

Take Amazon Titan Text Lite as an example: the cost to train 1,000 tokens is $0.0004, and storing the custom model each month is $1.95. Once you’ve fine-tuned it, you can’t run it in on-demand mode; you must spin it up under the provisioned throughput plan, starting at $7.10 per hour for no commitment. If your training corpus has 1 million tokens and you run five epochs, you’d process 5 million tokens total. At $0.0004 per 1,000 tokens, that training job would cost around $2 (5 million tokens / 1,000 × 0.0004 = 2).

Although $2 for training sounds low, you might find that a large-scale dataset or more complex model raises the cost significantly. For instance, Amazon Nova Pro charges $0.008 per 1,000 tokens. That same 5 million-token job would be $40 to train. The monthly storage for the custom model remains $1.95, but the bigger expense usually comes from the subsequent inference hours once you run that model in production.

Image-Based or Embeddings Fine-Tuning

If you customize the Amazon Titan Image Generator, each image used in training is $0.005, plus the same $1.95 monthly storage. Inference for that custom version is $23.40 per hour at no commitment. If you commit to a month or six months, your hourly rate will drop. It’s important to estimate how many images you plan to train on and how large your concurrency might be once your model goes live.

Distillation

Bedrock Model Distillation can lower inference costs in the long run by producing smaller student models, but you pay for the synthetic data generation from a teacher model and the fine-tuning step. If you plan to run a huge teacher model at on-demand rates for many tokens to create synthetic data, that can be expensive. However, the final student model might cost less per hour to run once you place it on provisioned throughput.

Custom Model Import: BYO Models and On-Demand Copies

Another path to a custom model is Custom Model Import, where you bring your own pre-trained weights (from open-source, internal projects, or external ML pipelines) into Bedrock. There’s no fee to import the model, but once it’s active, your costs come from how many copies of that model you spin up and for how long. Bedrock measures this in Custom Model Units (CMUs), priced at $0.0785 per CMU per minute in 5-minute increments.

If your model requires, say, 2 CMUs, running one copy of it for five minutes would cost $0.0785 × 2 × (5/60) = $0.01308. It might seem small on paper, but keep in mind that concurrency can drive multiple copies. Also, large parameter counts or big context windows might require more CMUs per copy. For instance, a Llama 3.1 70B model with a 128K context might need 8 CMUs. That can multiply your per-minute rate significantly.

When the model sees no traffic for five minutes, Bedrock automatically scales your copies down to zero, stopping charges. The main trade-off is that your first request after a period of inactivity can experience a “cold start.” If you need guaranteed sub-second latency, you might keep at least one copy always running, which incurs a steady cost.

Other Potential Charges (Storage, Data, Rerank, Knowledge Bases)

In addition to inference, customization, and import fees, you might see a few more line items on your monthly statement.

Data Transfer: If your application continuously pulls data from multiple AWS Regions, you could accrue data egress charges. On the other hand, if you’re set up in the same Region as your data sources, you avoid these fees.

Storage: Beyond storing your fine-tuned models, some usage patterns rely on Knowledge Bases (for RAG) or big volumes of text or vector indexes. If you store those embeddings in OpenSearch or rely on Bedrock’s vector store, you’ll pay for that capacity.

Rerank: Certain text models (like Cohere Rerank 3.5 or Amazon-rerank-v1.0) charge by query. Amazon-rerank-v1.0 is $1.00 per 1,000 queries, where each query can have up to 100 document chunks. If you exceed that, each set of 100 document chunks counts as another query, compounding cost.

Long Context Windows: Some models support longer context windows. Processing bigger prompts might raise your token consumption drastically, so watch for usage spikes that come from large input contexts.

Edge Cases and Special Considerations

Engineers often overlook a few unique details that influence cost:

Latency-Optimized Inference

Anthropic’s Claude 3.5 Haiku, for example, can run in a latency-optimized mode at $0.001 per 1,000 input tokens and $0.005 per 1,000 output tokens in certain Regions like US East (N. Virginia). While you gain faster responses, the token cost is higher than standard. If you run chatbots with real-time SLAs, it might be worthwhile. However, if your application tolerates slower responses, standard pricing could save a considerable sum.

Multi-Region Batch

You can theoretically run batch jobs in multiple Regions to reduce overall completion time or route requests closer to data sources. Each Region charges its own rate. If you store your data in US West (Oregon) but process it in US East (N. Virginia), you will incur cross-region data transfer plus the local inference costs.

Higher Parameter Models

Large models will likely cost more per token or per hour. For example, Claude 3 Opus on-demand is $0.015 per 1,000 input tokens and $0.075 per 1,000 output tokens, noticeably higher than smaller variants. If you need advanced language reasoning from huge models, keep an eye on how quickly token usage ramps up.

Model Distillation Overhead

If you plan to distill your model frequently or generate synthetic data from a high-end “teacher” model, you might pay a premium on the teacher’s token rates, even if your final distilled model runs cheaply. Plan that pipeline carefully to avoid repeated training cycles that inflate your monthly bill.

A Complete Bedrock Pricing Example

To illustrate how costs stack up in an end-to-end solution, let’s imagine a fictional organization called “Skyline Analytics.” Skyline builds a generative AI assistant that handles support tickets, produces marketing images, and offers real-time data retrieval through an internal knowledge base. They want to integrate everything within Amazon Bedrock.

1. Text Chat On-Demand

  • Skyline’s assistant uses Amazon Nova Lite for text generation in real time. On a typical day, the application processes 100,000 user queries, each with a 300-token prompt and a 200-token response.
  • Input tokens per day: 100,000 × 300 = 30,000,000 tokens (30,000 per 1,000).
  • Output tokens per day: 100,000 × 200 = 20,000,000 tokens (20,000 per 1,000).
  • Amazon Nova Lite on-demand: $0.00006 per 1,000 input tokens and $0.00024 per 1,000 output tokens.
  • Daily text generation cost:

Input: (30,000 × 0.00006) = $1.80

Output: (20,000 × 0.00024) = $4.80

Total: $6.60 per day, or ~$200 per month for text chat alone.

2. Daily Batch Summaries

  • Overnight, Skyline processes a large set of transcripts (150 million tokens input, 5,000,000 tokens output) to summarize customer interactions. They use the batch mode of the same Amazon Nova Lite.
  • Nova Lite batch rate: $0.00003 per 1,000 input tokens, $0.00012 per 1,000 output tokens.
  • Batch cost:

Input: (150,000 × 0.00003) = $4.50

Output: (5,000 × 0.00012) = $0.60

Total: $5.10 per night, or about $153.00 monthly.

3. Image Generation for Marketing

  • Skyline uses Amazon Titan Image Generator v2 at standard resolution (smaller than 512×512) to create product marketing images. They generate around 10,000 images monthly, mostly standard quality.
  • On-demand rate: $0.008 per image (standard).
  • Monthly cost: 10,000 × $0.008 = $80.00

4. Agents and Knowledge Bases

  • Their AI assistant uses Bedrock Agents for a multi-step workflow that retrieves relevant data from a Knowledge Base. The Knowledge Base indexes about 100,000 documents of unstructured text (several million tokens), stored in an OpenSearch vector index. They pay the underlying cost for that index, but let’s assume it’s negligible for now.
  • Each Agent call is effectively an invocation of the text model plus a Knowledge Base retrieval step. The text model cost is already accounted for in the daily usage. However, if they add rerank steps or advanced retrieval, fees like Amazon-rerank-v1.0 must be added at $1.00 per 1,000 queries. Suppose the system does 50,000 rerank queries per month. That’s an additional $50 monthly.
  • This cost is overshadowed by text tokens, but it’s a line item worth noting.

5. Custom Fine-Tuned Model

  • Skyline decides to fine-tune Amazon Titan Text Express with proprietary data. They have a training corpus of 20 million tokens, repeated across 3 epochs, for a total of 60 million tokens trained. Titan Text Express charges $0.008 per 1,000 tokens for training, storing the custom model at $1.95 monthly.
  • Training cost: (60,000,000 / 1,000) × 0.008 = $480
  • Provisioned Throughput: After fine-tuning, Skyline invests in a single model unit, no commitment, at $20.50 per hour. If they run it continuously for 30 days, that’s 720 hours × $20.50 = $14,760 per month. This is far higher than the on-demand text usage.
  • They realize that for consistent usage, a monthly commitment at $18.40 per hour (saving $2.10/hr) might be better. Over 720 hours it’s $13,248 a month, still not trivial. If usage remains stable and they want an even bigger discount, the six-month rate is $14.80 per hour, for $10,656 per month.

6. Custom Model Import

  • They also plan to import a specialized open-source Llama 3.1 Instruct (70B) model. According to documentation, that model might need 8 CMUs. Suppose Skyline typically keeps 2 copies running to handle concurrency. Each CMU is $0.0785 per minute. Two copies means 16 CMUs total. Over an hour, that’s 16 × $0.0785 × 60 = $75.36.
  • If Skyline only needs those copies active for 6 hours per day (peak usage), the daily cost is $75.36 × 6 = $452.16. Over a 30-day month, that’s $13,564.80. If no requests come in, the system auto-scales down after 5 minutes of inactivity. But if traffic remains steady, they'll see persistent charges.

When you tally these examples, Skyline’s final bill can range from a modest $400-$500 monthly (if they rely primarily on on-demand for small text or image tasks) to well over $10,000 once they incorporate large custom or fine-tuned models on dedicated throughput. Those stark differences highlight the importance of understanding Bedrock's pricing structure and carefully planning usage patterns, concurrency, and the actual need for specific models.

Strategies for Cost Optimization

To keep costs in line, you have a few straightforward levers:

1. Carefully Evaluate Model Size and Type

A smaller model like Nova Micro or Titan Text Lite might provide enough accuracy for many tasks at a fraction of the cost. Resist the temptation to always default to the largest parameter count or “most powerful” model.

2. Use Batch Mode When Possible

If your use case tolerates offline processing, consider collecting prompts into a single job, leveraging the batch discounts. This will halve your per-token rate for text or embeddings.

3. Take Advantage of Prompt Caching

For repeated instructions or context that remain stable, caching can cut your input token cost by up to 90%. This is particularly beneficial if your application includes large static prompts in every request.

4. Assess Your Provisioned Throughput Commitments

Once you find a stable traffic pattern, consider 1-month or 6-month commitments. For example, dropping from $20.50 to $14.80 an hour for Titan Text Express saves thousands over a month.

5. Use Distillation Strategically

Generating synthetic data from a teacher model can be expensive, but the resulting smaller student model might run at a cheaper inference rate. Evaluate whether the upfront training cost justifies the lower runtime bill, and consider the time to ROI on this investment.

Common Mistakes and Pitfalls

Across many deployments, a few patterns repeatedly lead to overspending:

  • Overprovisioning: Reserving more model units than your traffic genuinely requires will rack up unused capacity costs.
  • Forgetting Data Transfer: If you do cross-region calls or stream large data sets, egress charges can quickly dwarf token usage expenses.
  • Excessive Fine-Tuning: Some teams fine-tune large models several times a month. If you aren’t seeing proportional gains in accuracy or user satisfaction, that’s wasted budget.
  • Neglecting Cache Opportunities: Repeatedly including the same prompt context in every request, without using caching, leads to inflated token counts.

Keep an eye out for these situations, and you can save thousands of dollars in your AWS bill.

Conclusion

Amazon Bedrock’s pricing landscape is complex. We split it into the following categories: on-demand tokens, provisioned throughput, fine-tuning, and custom imports. This should give you a clearer picture of where your costs come from, but the key to reducing costs is still finding the right configurations for your use case. Small or unpredictable workloads often favor on-demand, while large stable workloads will be cheaper under provisioned throughput. For organizations that need advanced domain adaptation, fine-tuned or custom models will get you better performance, but make sure the additional costs justify it.

No single configuration fits every use case, but thorough planning, robust testing, and ongoing monitoring of usage patterns are the cornerstones of a sustainable and cost-effective AI deployment. The better you understand these pricing levers, the more control you have over your costs.

Generative AI & LLMOps
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

The Art of Designing Bedrock Agents: Parallels with Traditional API Design

Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.

Generative AI & LLMOps

Prompt Caching: Saving Time and Money in LLM Applications

Explore how to use prompt caching on Large Language Models (LLMs) such as Amazon Bedrock and Anthropic Claude to reduce costs and improve latency.

Generative AI & LLMOps

Speech-to-Speech: Designing an Intelligent Voice Agent with GenAI

Learn how to build and implement an intelligent GenAI-powered voice agent that can handle real-time complex interactions including key design considerations, how to plan a prompt strategy, and challenges to overcome.

Generative AI & LLMOps