Learn everything you need to know about Amazon Bedrock, from how to use it, common inference parameters across models, and prompt engineering, to pricing, governance, security, and benchmarks.
Today, AWS made Amazon Bedrock generally available. Several notable features remain in limited preview: Agents, Knowledge Base, and certain models. At Caylent, we have been working with Bedrock for several months. In this post, I will walk through Amazon Bedrock usage, share what we have learned, and share our guidance for effectively adding Bedrock to your applications.
Enablement
First, you must request access to the various models. Some model access gets approved immediately, some model access takes longer. You must perform these steps in each region where you will be accessing Bedrock. Do this now.
1. Navigate to the Amazon Bedrock console.
2. Use the left navigation bar to select “Model access”.
3. Click “Edit” in the top-right.
4. Click the checkbox above all of the models on the top left.
5. Click “Save changes” on the bottom right.
Next, you’ll need to update your AWS SDK to have the latest service definitions. Here’s how to do that for Python:
Now, we have the bedrock API available. We can run a simple command to verify:
bedrock.list_foundation_models()
Here, you can see each model, the modalities it supports, the customizations it supports, and some other useful information.
Now, let’s dive into using Bedrock.
Usage
To perform ad-hoc experiments with various prompts and settings, I recommend using the Bedrock console. There are several examples available throughout the console. You can see an example of the chat playground below.
However, the majority of the real-world usage will happen through the APIs and SDKs, so that is what we will focus on in this post.
There are two primary APIs in boto3, the python SDK, are invoke_model
and invoke_model_with_response_stream
. invoke_model returns the entire inference from the model at once, and the latency can be high for longer responses. invoke_model_with_response_stream
returns the responses as the tokens arrive from the model - this can be desirable for interactive applications. For these examples, we’ll use the invoke_model API.
Invoke Model has several inference parameters that we’ll walk through below, but for now, let's look at the basic API. Here, we’ll use the Anthropic’s Claude V2 model to translate a phrase from English to French.
Do you speak French? How does that look? Of course, translation is a very straightforward task, and these models are capable of much, much more. That’s for another post, though.
You may have noticed a few parameters in the request here that are worth talking through. Just keep in mind that it’s best to start with the defaults for these parameters and focus more on prompt engineering before changing these values.
Common Inference Parameters
In most large language models, three main parameters guide the construction of word sequences: Temperature, Top K, and Top P.
- Temperature modifies the probability distribution of choosing the next word in a sequence. Close to zero, it opts for higher-probability words, and away from zero, it selects lower-probability words. Technically, it modulates the probability density function for the next tokens — lower values result in more deterministic responses, and higher values give more random responses.
- Top K defines the cut-off for the number of words (tokens) for each completion to choose from, ordered by their probabilities. A lower Top K value reduces the chance of an unusual word being selected.
- Top P works similarly to Top K but determines the cut-off based on the sum of probabilities rather than the total number of word choices.
These parameters influence whether a language model like Claude V2 would produce 'common' word sequences ("horses") over more unusual ones ("unicorns") in response to a prompt, for example, "I hear the hoof beats of". A high Temperature and low Top K or Top P value would favor more common outcomes. It is recommended, when prompt tuning, that you only modify one of these values at a time. The Amazon documentation has an excellent description of these values.
Additionally, some models support things like a repetition penalty, stop sequences, length penalties, and more. Let's look at each of the currently available models.
Amazon Titan
The titan models accept the following shape for their input:
- temperature is a float with a minimum of 0, maximum of 1, and default of 0
- topP is a float with a minimum of 0, maximum of 1, and default of 1
- maxTokenCount is an integer with a minimum of 0, maximum of 8000, and default of 512
The response from Amazon Titan is shaped like this:
Keep in mind that this model is in preview and subject to change.
Anthropic Claude Models
The Anthropic Claude models accept the following shape for their input:
- Temperature is a float with a minimum of 0, maximum of 1, and default of 0.5
- top_p is a float with a minimum of 0, maximum of 1, and default of 1
- top_k is an int with a minimum of 0, maximum of 500, and default of 250
- max_tokens_to_sample is an int with a minimum of 0, a maximum of the model’s context length, and a default of 200
The max_tokens_to_sample parameter being bound to the model’s context length here means that the combination of the input prompt and the generated response is limited to the model's context length.
Claude’s response objects are shaped like this:
AI21 Labs Jurassic Models
The AI21 Labs Jurassic models accept the following shape for their input:
See the Amazon documentation here for detailed descriptions of each of these parameters.
The output shape of the Jurassic models looks like this:
Notice the detailed information returned about each token. This is currently unique to just the j2 models and Cohere and could be used to create very rich UIs. Unfortunately, in this example it did not return the correct response.
Cohere Command Model
Cohere takes the following input shape:
- temperature is a float with a minimum of 0, a maximum of 5, and a default of 0.9
- p is a float with a minimum of 0, a maximum of 1, and a default of 0.75
- k is a float with a minimum of 0, a maximum of 500, and a default of 0
- max_tokens is an int with a minimum of 1, a maximum of 4096, and a default of 20
- stop_sequences can take up to 4 strings
- return_likelihoods defaults to NONE but can be set to GENERATION or ALL to return the probability of each token
- num_generations has a minimum of 1, maximum of 5, and default of 1
The responses are shaped like this:
—-
There are also image models available from Stability AI, but we will cover the usage of those models in a future post.
Much of Amazon Bedrock’s integration into applications will happen through SDKs like LangChain, GripTape, and more. We’ll cover the usage of these libraries in a future post.
Prompt Engineering
When we think about how to construct our prompts, we need to consider the prompt, context, model, the model’s capability, and performance of what we’re asking the model to perform. We’ll dive deeper into these concepts in the next section on pricing, but keep this in mind for now.
Each model will have different prompt engineering techniques for the best results. For now, I’ll link to each provider's page on prompt engineering and reference our own guide to prompt engineering here.
Pricing
When thinking about generative model pricing, it is important to think along three dimensions:
- Capability - the model’s ability to correctly infer results with the minimum number of tokens
- Performance - the model’s throughput and latency
- Cost - the per token cost
It’s best to think about this through an example. I would like to ask Anthropic’s Claude V2 to translate natural language into a SQL query. Claude V2 has a higher per-token cost than Claude Instant but is able to respond correctly with fewer examples in the prompt.
There is a threshold at which it makes sense to use a higher per-token cost model because the total number of tokens required is lower or the capability is higher. Keep in mind the cost of the developer’s time to engineer effective prompts.
In addition to serverless per-inference pricing, you can also purchase “Provisioned Throughput” (hourly) for some models. The pricing for these is available on the AWS Marketplace. If you’re using a certain sustained number of tokens per hour, then it makes sense to use provisioned throughput. That usage depends on the current model performance as well. There are additional discounts available for “sustained” use at 1-month and 6-month intervals. This is highly workload-dependent, and I expect this to evolve over time.
Governance, Monitoring, Security, and Privacy
Bedrock integrates with IAM, CloudTrail, and CloudWatch by default. It emits useful metrics to the AWS/Bedrock namespace in CloudWatch, which allows you to use CloudWatch metric math to generate useful statistics. The metrics can be broken down per model and include InputTokenCount, OutputTokenCount, InvocationLatency, and total number of Invocations. For Caylent, I put an alarm on the total number of input tokens in an hour just to make sure we aren’t going wild with our usage.
All data remains in the region where the API calls happen. Data at rest is AES-256 encrypted via KMS-managed keys or even CMKs. All API calls use a minimum of TLS 1.2 encryption.
You can use PrivateLink VPC endpoints to access Bedrock inside of subnets that don’t contain routes to the internet.
Initial Benchmarks
Since Bedrock only came out today, we’ve only tested in production for about 8 hours now. I’d like to share our initial results in this table.
These graphs represent a snapshot of the cost and performance axes of the Pricing diagram above at the time this article was written. The capability axis is very dependent on the business outcomes you're targeting, so we suggest starting your model choice by settling on the desired performance and lowest cost, and then iteratively experimenting with models to achieve your target capability.
What’s Next?
Check out the following references for more information: Documentation, Antje’s AWS Blog Post, Bedrock Workshop.
In future posts, we’ll explore Fine-Tuning, Knowledge Bases, Agents, and other Bedrock features like embeddings.
I’ve been very excited for Bedrock to be generally available. In the last several months of usage and service improvements, I’ve become convinced that this is a game-changing set of tools for developers and product operators. You owe it to yourself to explore this service now. This is an exciting time to work in our industry!
If you’re interested in integrating Bedrock into your product, our Generative AI Flight Plan Catalyst will accelerate your time to proof-of-concept. To quickly implement a custom AI chatbot powered by your data and Anthropic Claude V2 on Bedrock, check out our Generative AI Knowledge Base Catalyst. We have deep experience in this space and work with it every day. If you’re interested in working on interesting projects across a wide variety of customers, I’d encourage you to apply for a role at Caylent by visiting our careers page.
Accelerate your GenAI initiatives
Leveraging our accelerators and technical experience
Browse GenAI OfferingsRandall Hunt
Randall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.
View Randall's articlesRelated Blog Posts
How We Utilize AI at Caylent
At Caylent, we're using generative AI across all aspects of our business, from accelerating and improving internal workflows, to offering more innovative, tailored solutions to our customers.
Amazon Q Developer: Transform Use Cases
See all the ways that Amazon Q’s Developer: Transform can help you migrate and modernize your data system.