Amazon Bedrock Pricing Explained
Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.
Learn everything you need to know about Amazon Bedrock, from how to use it, common inference parameters across models, and prompt engineering, to pricing, governance, security, and benchmarks.
Today, AWS made Amazon Bedrock generally available. Several notable features remain in limited preview: Agents, Knowledge Base, and certain models. At Caylent, we have been working with Bedrock for several months. In this post, I will walk through Amazon Bedrock usage, share what we have learned, and share our guidance for effectively adding Bedrock to your applications.
First, you must request access to the various models. Some model access gets approved immediately, some model access takes longer. You must perform these steps in each region where you will be accessing Bedrock. Do this now.
1. Navigate to the Amazon Bedrock console.
2. Use the left navigation bar to select “Model access”.
3. Click “Edit” in the top-right.
4. Click the checkbox above all of the models on the top left.
5. Click “Save changes” on the bottom right.
Next, you’ll need to update your AWS SDK to have the latest service definitions. Here’s how to do that for Python:
Now, we have the bedrock API available. We can run a simple command to verify:
bedrock.list_foundation_models()
Here, you can see each model, the modalities it supports, the customizations it supports, and some other useful information.
Now, let’s dive into using Bedrock.
To perform ad-hoc experiments with various prompts and settings, I recommend using the Bedrock console. There are several examples available throughout the console. You can see an example of the chat playground below.
However, the majority of the real-world usage will happen through the APIs and SDKs, so that is what we will focus on in this post.
There are two primary APIs in boto3, the python SDK, are invoke_model
and invoke_model_with_response_stream
. invoke_model returns the entire inference from the model at once, and the latency can be high for longer responses. invoke_model_with_response_stream
returns the responses as the tokens arrive from the model - this can be desirable for interactive applications. For these examples, we’ll use the invoke_model API.
Invoke Model has several inference parameters that we’ll walk through below, but for now, let's look at the basic API. Here, we’ll use the Anthropic’s Claude V2 model to translate a phrase from English to French.
Do you speak French? How does that look? Of course, translation is a very straightforward task, and these models are capable of much, much more. That’s for another post, though.
You may have noticed a few parameters in the request here that are worth talking through. Just keep in mind that it’s best to start with the defaults for these parameters and focus more on prompt engineering before changing these values.
In most large language models, three main parameters guide the construction of word sequences: Temperature, Top K, and Top P.
These parameters influence whether a language model like Claude V2 would produce 'common' word sequences ("horses") over more unusual ones ("unicorns") in response to a prompt, for example, "I hear the hoof beats of". A high Temperature and low Top K or Top P value would favor more common outcomes. It is recommended, when prompt tuning, that you only modify one of these values at a time. The Amazon documentation has an excellent description of these values.
Additionally, some models support things like a repetition penalty, stop sequences, length penalties, and more. Let's look at each of the currently available models.
The titan models accept the following shape for their input:
The response from Amazon Titan is shaped like this:
Keep in mind that this model is in preview and subject to change.
The Anthropic Claude models accept the following shape for their input:
The max_tokens_to_sample parameter being bound to the model’s context length here means that the combination of the input prompt and the generated response is limited to the model's context length.
Claude’s response objects are shaped like this:
The AI21 Labs Jurassic models accept the following shape for their input:
See the Amazon documentation here for detailed descriptions of each of these parameters.
The output shape of the Jurassic models looks like this:
Notice the detailed information returned about each token. This is currently unique to just the j2 models and Cohere and could be used to create very rich UIs. Unfortunately, in this example it did not return the correct response.
Cohere takes the following input shape:
The responses are shaped like this:
—-
There are also image models available from Stability AI, but we will cover the usage of those models in a future post.
Much of Amazon Bedrock’s integration into applications will happen through SDKs like LangChain, GripTape, and more. We’ll cover the usage of these libraries in a future post.
When we think about how to construct our prompts, we need to consider the prompt, context, model, the model’s capability, and performance of what we’re asking the model to perform. We’ll dive deeper into these concepts in the next section on pricing, but keep this in mind for now.
Each model will have different prompt engineering techniques for the best results. For now, I’ll link to each provider's page on prompt engineering and reference our own guide to prompt engineering here.
When thinking about generative model pricing, it is important to think along three dimensions:
It’s best to think about this through an example. I would like to ask Anthropic’s Claude V2 to translate natural language into a SQL query. Claude V2 has a higher per-token cost than Claude Instant but is able to respond correctly with fewer examples in the prompt.
There is a threshold at which it makes sense to use a higher per-token cost model because the total number of tokens required is lower or the capability is higher. Keep in mind the cost of the developer’s time to engineer effective prompts.
In addition to serverless per-inference pricing, you can also purchase “Provisioned Throughput” (hourly) for some models. The pricing for these is available on the AWS Marketplace. If you’re using a certain sustained number of tokens per hour, then it makes sense to use provisioned throughput. That usage depends on the current model performance as well. There are additional discounts available for “sustained” use at 1-month and 6-month intervals. This is highly workload-dependent, and I expect this to evolve over time.
Bedrock integrates with IAM, CloudTrail, and CloudWatch by default. It emits useful metrics to the AWS/Bedrock namespace in CloudWatch, which allows you to use CloudWatch metric math to generate useful statistics. The metrics can be broken down per model and include InputTokenCount, OutputTokenCount, InvocationLatency, and total number of Invocations. For Caylent, I put an alarm on the total number of input tokens in an hour just to make sure we aren’t going wild with our usage.
All data remains in the region where the API calls happen. Data at rest is AES-256 encrypted via KMS-managed keys or even CMKs. All API calls use a minimum of TLS 1.2 encryption.
You can use PrivateLink VPC endpoints to access Bedrock inside of subnets that don’t contain routes to the internet.
Since Bedrock only came out today, we’ve only tested in production for about 8 hours now. I’d like to share our initial results in this table.
These graphs represent a snapshot of the cost and performance axes of the Pricing diagram above at the time this article was written. The capability axis is very dependent on the business outcomes you're targeting, so we suggest starting your model choice by settling on the desired performance and lowest cost, and then iteratively experimenting with models to achieve your target capability.
Check out the following references for more information: Documentation, Antje’s AWS Blog Post, Bedrock Workshop.
In future posts, we’ll explore Fine-Tuning, Knowledge Bases, Agents, and other Bedrock features like embeddings.
I’ve been very excited for Bedrock to be generally available. In the last several months of usage and service improvements, I’ve become convinced that this is a game-changing set of tools for developers and product operators. You owe it to yourself to explore this service now. This is an exciting time to work in our industry!
If you’re interested in integrating Bedrock into your product, our Generative AI Flight Plan Catalyst will accelerate your time to proof-of-concept. To quickly implement a custom AI chatbot powered by your data and Anthropic Claude V2 on Bedrock, check out our Generative AI Knowledge Base Catalyst. We have deep experience in this space and work with it every day. If you’re interested in working on interesting projects across a wide variety of customers, I’d encourage you to apply for a role at Caylent by visiting our careers page.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsRandall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.
View Randall's articlesExplore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.
Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.
Explore how to use prompt caching on Large Language Models (LLMs) such as Amazon Bedrock and Anthropic Claude to reduce costs and improve latency.