Model Repository
For our example, we have included the following LLMs from Bedrock:
Amazon:
- Amazon Titan Large
- Amazon Titan Express
Anthropic:
- Anthropic Claude Instant
- Anthropic Claude V2
Cohere:
AI21:
- Jurassic Mid
- Jurassic Ultra
We will extend this list as new LLMs become available on Bedrock.
Prompt Repository
We need to store all the prompts for benchmarking LLMs, and DynamoDB is a great option to catalog and version prompts for repeatability. For our example, we have leveraged the following categories/datasets adapted from here. This includes knowledge, code, instruct, creativity, and reflexion. Below is a snapshot of the DynamoDB table that includes our curated prompts.
Prompt/Response Workflow Orchestration
We use Step Function to orchestrate several AWS Lambda functions in parallel by calling Bedrock APIs for each selected LLM and storing the responses in the DynamoDB table. Below, you can find the state machine language and resulting graph after a successful run.
Step Function Definition
Lambda Functions:
We have developed a Lambda function template and provided LLM-specific prompt configuration and parameters to build 4 Lambda functions, one for each LLM.
The outputs of these Lambda functions capture the LLM response, token size (if the output token size is part of the model response otherwise number of characters), latency and model parameters. The results are recorded in another DynamoDB table called “bedrock_benchmark”. The Step Function is scheduled to trigger the Lambda functions daily. If the latest response from an LLM in any category differs from its previous response, then a record is added to this table to detect and track drift.
Below you can see the Lambda function code.
We had to time the invocation latency and calculate output token size in the code, but while writing this blog post, Bedrock invocation metrics including InputTokenCount, OutputTokenCount, and InvocationLatency were added to CloudWatch Metrics and as part of the response as well. Thanks to AWS's continuous improvement, one can update the above code to retrieve the metrics and simplify the code.
As you can see in screenshots below, each LLM on Bedrock has a different input configuration, which is provided to the corresponding Lambda function as environment variables. You may tweak parameters influencing the LLM response, including temperature, top P, max token count, and top K. The naming convention across models might be different, which has made the Lambda function lengthy to accommodate all variations. If you implement this workflow for a real application, modify the parameters to fit your requirements.
Amazon Titan Lambda Function Variables:
Anthropic Claude Lambda Function Variables:
Cohere Lambda Function Variables:
AI21 Lambda Function Variables:
User Interface - feedback collection
We used Streamlit to build a simple but effective UI to present the cases where the response for any LLM in any category is different from its previous one and to collect feedback from a human evaluator. The user can choose between 0 to indicate “INCORRECT”, 1 for “CORRECT” response, and 2 for “EXCELLENT” to guide improvements.
In our Streamlit app, it first fetches all of the curated prompts as well as the most recent LLM response and their previous different response. Then it iterates through them by presenting the identified cases to the user, collecting user feedback accordingly and finally storing the feedback in the DynamoDB table. We deployed our Streamlit app in SageMaker Studio using a Terminal instance, and here is how it looks with an example.