Caylent Catalysts™
Generative AI Strategy
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.
Conversations about Amazon Bedrock often emphasize how quickly you can integrate generative AI into your applications. Yet behind the convenience lies a pricing structure that can surprise companies that jump in without a thorough understanding. Simply relying on guesswork about tokens, compute hours, or storage can produce AWS bills higher than anticipated.
That’s where a deeper level of pricing comprehension comes in. A conversation about Bedrock’s pricing structure spans on-demand usage, dedicated capacity, fine-tuning, custom model hosting, and more, each with its own price. If you’re an engineering manager or architect tasked with controlling costs, you need detail-oriented insights on how to plan effectively.
In the sections that follow, we’ll explore every major facet of Bedrock’s pricing, referencing real numbers from the latest (February 2025) official AWS documentation. Rather than glossing over token rates or speaking in vague cost statements, we’ll show exactly what your usage might look like in practice. By the end of this article, you’ll see how to forecast your monthly bill, whether you’re prototyping a chatbot, spinning up fine-tuned image models, or orchestrating a multi-agent solution with knowledge bases and custom model imports.
The core of Amazon Bedrock has two major approaches to inference (the process of running prompts against a foundation model): on-demand (which includes batch) and provisioned throughput. The first charges you only for the tokens or images you process, while the second reserves a certain capacity (model units) for a fixed hourly price. On top of that you’ll see additional charges if you customize models, either by training them on your own data or by importing a model that you’ve trained elsewhere.
On-Demand suits teams that can’t accurately predict usage or that expect sporadic surges, because there’s no contract or hourly minimum. By contrast, Provisioned Throughput is a better fit for workloads that are both steady and substantial, offering guaranteed performance and predictable costs.
Yet there’s more to it. Various extra charges may appear, such as data transfer fees, storage for your custom models, or evaluation tasks that run on your behalf. This overview sets the stage for a more in-depth breakdown of each cost category.
On-demand pricing lets you pay strictly for what you process, whether that means tokens of text or generated images. If your usage is modest or unpredictable, on-demand can keep you from overcommitting resources and paying for capacity you don’t need. But “pay as you go” can add up quickly if you run frequent or large-scale inference jobs. Let’s talk about the specific charges in On-demand.
When you invoke a text-based foundation model, Bedrock measures both your input tokens (the text prompt) and your output tokens (the model’s generated response). A thousand tokens might correspond to a few paragraphs, though the exact relationship depends on factors like whitespace and punctuation. Different providers charge different rates, so you should match the model’s cost structure to your specific application.
For example, Amazon Nova Micro (a smaller text generation model) charges $0.000035 per 1,000 input tokens and $0.00014 per 1,000 output tokens when used on-demand. If you cached a portion of the prompt for five minutes, the cached tokens cost $0.00000875 per 1,000 for input, which can produce significant savings for repeated context or instructions. Meanwhile, AI21 Labs’ Jamba 1.5 Large sets a rate of $0.002 per 1,000 input tokens and $0.008 per 1,000 output tokens. If you’re dealing with thousands of requests per day, these small figures accumulate quickly.
For embedding models, the cost typically applies only to input tokens because you’re passing text in for vector representation, not generating new text out. For instance, Amazon Titan Text Embeddings costs $0.0001 per 1,000 input tokens in on-demand mode, while Amazon Titan Multimodal Embeddings runs $0.0008 per 1,000 input tokens or $0.00006 per input image. If you switch to batch mode for the Titan embeddings, the cost drops to $0.0004 per 1,000 input tokens and $0.00003 per input image.
Bedrock also supports generative image and video models, each priced by the piece (image) or by the second (video). For example, Amazon Nova Canvas generates up to 1024×1024 images at a rate of $0.04 per Standard-quality image or $0.06 per Premium-quality image. If you need 2048×2048 images, each generation will cost $0.06 (Standard) or $0.08 (Premium). For video, Amazon Nova Reel at 720p resolution costs $0.08 per second of generated footage.
If you’re using Stability AI’s Stable Diffusion 3.5 Large (on-demand), you’re looking at $0.08 per image. Some older versions (like SDXL 1.0) might cost $0.04 for a smaller resolution image at standard quality. Pay attention to any step count or resolution thresholds because exceeding them may shift you into a higher price tier.
When you know you’ll process large volumes of prompts or images in a single, scheduled run, batch inference can lower your per-token or per-image rate. Amazon Nova Micro text generation, for example, charges $0.0000175 per 1,000 input tokens and $0.00007 per 1,000 output tokens in batch mode, roughly half its on-demand cost.
Anthropic’s Claude 3.5 Haiku also offers batch discounts. If you process input tokens in large batches, your rate is $0.0005 per 1,000 input tokens (down from $0.0008 in pure on-demand). That means you save money when you group together tasks, such as summarizing an entire day’s worth of customer support transcripts. The trade-off is that real-time latencies are higher because batch jobs typically run asynchronously.
Teams that handle steady or high-volume inference often opt for Provisioned Throughput. Here, you purchase model units for a specific base or custom model. Each model unit guarantees a certain token throughput per minute (or images/second, depending on the model). You’re billed hourly, whether you fully utilize that capacity or not.
For some models, you can provision capacity on a no-commit basis, paying an hourly rate that you can stop at any time. For example, Amazon Titan Text Lite (base or custom) charges $7.10 per hour per model with no commitment. If you commit to one month, the rate drops to $6.40 per hour, and a six-month commitment lowers it further to $5.10. Anthropic’s Claude Instant runs $44.00 per hour with no commitment, while a six-month commitment can go as low as $22.00 per hour per model unit.
Selecting the right plan depends on usage stability. If you have a new product with uncertain usage, you might start with a no-commit plan. But for an established application with predictable daily traffic, a monthly or half-year commitment can yield significant savings.
Bedrock’s multi-tenant environment hosts popular base models in an on-demand pool. However, once you fine-tune a model or import a custom one, you typically need dedicated capacity. Amazon Titan’s official pricing states that custom models for text generation (for instance, Titan Image Generator v2 customized) cost $23.40 per hour per model with no commitment, or $21.00 with a monthly term, and $16.85 with a six-month term. Because custom models are unique to your account, AWS can’t share them with other users, so you pay for the private capacity allocated to run them.
Training or fine-tuning a model is a separate cost line. For text-based models, you’ll pay a per-1,000-tokens-trained rate. For image-based or multimodal models, it might be per image processed. You’ll also pay a monthly storage fee for the resulting custom model, plus the hourly rate for any inference you run once it’s deployed.
Take Amazon Titan Text Lite as an example: the cost to train 1,000 tokens is $0.0004, and storing the custom model each month is $1.95. Once you’ve fine-tuned it, you can’t run it in on-demand mode; you must spin it up under the provisioned throughput plan, starting at $7.10 per hour for no commitment. If your training corpus has 1 million tokens and you run five epochs, you’d process 5 million tokens total. At $0.0004 per 1,000 tokens, that training job would cost around $2 (5 million tokens / 1,000 × 0.0004 = 2).
Although $2 for training sounds low, you might find that a large-scale dataset or more complex model raises the cost significantly. For instance, Amazon Nova Pro charges $0.008 per 1,000 tokens. That same 5 million-token job would be $40 to train. The monthly storage for the custom model remains $1.95, but the bigger expense usually comes from the subsequent inference hours once you run that model in production.
If you customize the Amazon Titan Image Generator, each image used in training is $0.005, plus the same $1.95 monthly storage. Inference for that custom version is $23.40 per hour at no commitment. If you commit to a month or six months, your hourly rate will drop. It’s important to estimate how many images you plan to train on and how large your concurrency might be once your model goes live.
Bedrock Model Distillation can lower inference costs in the long run by producing smaller student models, but you pay for the synthetic data generation from a teacher model and the fine-tuning step. If you plan to run a huge teacher model at on-demand rates for many tokens to create synthetic data, that can be expensive. However, the final student model might cost less per hour to run once you place it on provisioned throughput.
Another path to a custom model is Custom Model Import, where you bring your own pre-trained weights (from open-source, internal projects, or external ML pipelines) into Bedrock. There’s no fee to import the model, but once it’s active, your costs come from how many copies of that model you spin up and for how long. Bedrock measures this in Custom Model Units (CMUs), priced at $0.0785 per CMU per minute in 5-minute increments.
If your model requires, say, 2 CMUs, running one copy of it for five minutes would cost $0.0785 × 2 × (5/60) = $0.01308. It might seem small on paper, but keep in mind that concurrency can drive multiple copies. Also, large parameter counts or big context windows might require more CMUs per copy. For instance, a Llama 3.1 70B model with a 128K context might need 8 CMUs. That can multiply your per-minute rate significantly.
When the model sees no traffic for five minutes, Bedrock automatically scales your copies down to zero, stopping charges. The main trade-off is that your first request after a period of inactivity can experience a “cold start.” If you need guaranteed sub-second latency, you might keep at least one copy always running, which incurs a steady cost.
In addition to inference, customization, and import fees, you might see a few more line items on your monthly statement.
Data Transfer: If your application continuously pulls data from multiple AWS Regions, you could accrue data egress charges. On the other hand, if you’re set up in the same Region as your data sources, you avoid these fees.
Storage: Beyond storing your fine-tuned models, some usage patterns rely on Knowledge Bases (for RAG) or big volumes of text or vector indexes. If you store those embeddings in OpenSearch or rely on Bedrock’s vector store, you’ll pay for that capacity.
Rerank: Certain text models (like Cohere Rerank 3.5 or Amazon-rerank-v1.0) charge by query. Amazon-rerank-v1.0 is $1.00 per 1,000 queries, where each query can have up to 100 document chunks. If you exceed that, each set of 100 document chunks counts as another query, compounding cost.
Long Context Windows: Some models support longer context windows. Processing bigger prompts might raise your token consumption drastically, so watch for usage spikes that come from large input contexts.
Engineers often overlook a few unique details that influence cost:
Latency-Optimized Inference
Anthropic’s Claude 3.5 Haiku, for example, can run in a latency-optimized mode at $0.001 per 1,000 input tokens and $0.005 per 1,000 output tokens in certain Regions like US East (N. Virginia). While you gain faster responses, the token cost is higher than standard. If you run chatbots with real-time SLAs, it might be worthwhile. However, if your application tolerates slower responses, standard pricing could save a considerable sum.
Multi-Region Batch
You can theoretically run batch jobs in multiple Regions to reduce overall completion time or route requests closer to data sources. Each Region charges its own rate. If you store your data in US West (Oregon) but process it in US East (N. Virginia), you will incur cross-region data transfer plus the local inference costs.
Higher Parameter Models
Large models will likely cost more per token or per hour. For example, Claude 3 Opus on-demand is $0.015 per 1,000 input tokens and $0.075 per 1,000 output tokens, noticeably higher than smaller variants. If you need advanced language reasoning from huge models, keep an eye on how quickly token usage ramps up.
Model Distillation Overhead
If you plan to distill your model frequently or generate synthetic data from a high-end “teacher” model, you might pay a premium on the teacher’s token rates, even if your final distilled model runs cheaply. Plan that pipeline carefully to avoid repeated training cycles that inflate your monthly bill.
To illustrate how costs stack up in an end-to-end solution, let’s imagine a fictional organization called “Skyline Analytics.” Skyline builds a generative AI assistant that handles support tickets, produces marketing images, and offers real-time data retrieval through an internal knowledge base. They want to integrate everything within Amazon Bedrock.
1. Text Chat On-Demand
Input: (30,000 × 0.00006) = $1.80
Output: (20,000 × 0.00024) = $4.80
Total: $6.60 per day, or ~$200 per month for text chat alone.
2. Daily Batch Summaries
Input: (150,000 × 0.00003) = $4.50
Output: (5,000 × 0.00012) = $0.60
Total: $5.10 per night, or about $153.00 monthly.
3. Image Generation for Marketing
4. Agents and Knowledge Bases
5. Custom Fine-Tuned Model
6. Custom Model Import
When you tally these examples, Skyline’s final bill can range from a modest $400-$500 monthly (if they rely primarily on on-demand for small text or image tasks) to well over $10,000 once they incorporate large custom or fine-tuned models on dedicated throughput. Those stark differences highlight the importance of understanding Bedrock's pricing structure and carefully planning usage patterns, concurrency, and the actual need for specific models.
To keep costs in line, you have a few straightforward levers:
1. Carefully Evaluate Model Size and Type
A smaller model like Nova Micro or Titan Text Lite might provide enough accuracy for many tasks at a fraction of the cost. Resist the temptation to always default to the largest parameter count or “most powerful” model.
2. Use Batch Mode When Possible
If your use case tolerates offline processing, consider collecting prompts into a single job, leveraging the batch discounts. This will halve your per-token rate for text or embeddings.
3. Take Advantage of Prompt Caching
For repeated instructions or context that remain stable, caching can cut your input token cost by up to 90%. This is particularly beneficial if your application includes large static prompts in every request.
4. Assess Your Provisioned Throughput Commitments
Once you find a stable traffic pattern, consider 1-month or 6-month commitments. For example, dropping from $20.50 to $14.80 an hour for Titan Text Express saves thousands over a month.
5. Use Distillation Strategically
Generating synthetic data from a teacher model can be expensive, but the resulting smaller student model might run at a cheaper inference rate. Evaluate whether the upfront training cost justifies the lower runtime bill, and consider the time to ROI on this investment.
Across many deployments, a few patterns repeatedly lead to overspending:
Keep an eye out for these situations, and you can save thousands of dollars in your AWS bill.
Amazon Bedrock’s pricing landscape is complex. We split it into the following categories: on-demand tokens, provisioned throughput, fine-tuning, and custom imports. This should give you a clearer picture of where your costs come from, but the key to reducing costs is still finding the right configurations for your use case. Small or unpredictable workloads often favor on-demand, while large stable workloads will be cheaper under provisioned throughput. For organizations that need advanced domain adaptation, fine-tuned or custom models will get you better performance, but make sure the additional costs justify it.
No single configuration fits every use case, but thorough planning, robust testing, and ongoing monitoring of usage patterns are the cornerstones of a sustainable and cost-effective AI deployment. The better you understand these pricing levers, the more control you have over your costs.
Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.
View Guille's articlesCaylent Catalysts™
Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.
Caylent Catalysts™
Accelerate investment and mitigate risk when developing generative AI solutions.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsLearn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.
Explore how to use prompt caching on Large Language Models (LLMs) such as Amazon Bedrock and Anthropic Claude to reduce costs and improve latency.
Learn how to build and implement an intelligent GenAI-powered voice agent that can handle real-time complex interactions including key design considerations, how to plan a prompt strategy, and challenges to overcome.