Evaluating LLM Performance: A Benchmarking Framework on Amazon Bedrock

February 13, 2024

Generative AI & LLMOps

Generative AI (GenAI) creates new opportunities for automated benchmarking by adding output variability and model cost dimensions to traditional performance metrics. In this blog, we share a framework for monitoring alignment and drift across several Large Language Models (LLMs) hosted on Amazon Bedrock.

New LLMs are being released on a daily/weekly basis, and practitioners need to evaluate their performance to identify the most suitable ones for specific tasks. For that, different benchmark/evaluation tools and datasets are developed depending on the task the LLM is designed to tackle, whether it is coding, chatbot, summarization, instruction following ability, etc. Some of the popular benchmark datasets and metrics are the HumanEval dataset (code generation benchmark), the MT Bench (instruction-following ability benchmark), the MBPP (python code generation benchmark), and the Chatbot Arena (chatbot assistant benchmark).

Assessing LLM performance through automated benchmarking poses a considerable challenge due to the open-ended nature of the tasks, making it highly challenging to develop a program for automated evaluation of response quality. Hence, Human-in-the-loop (HITL), where human evaluators provide feedback on the quality of the text generated by the LLMs for different tasks, proves to be a viable solution. Amazon Comprehend's LLM Trust and Safety features offer an automated approach that can complement human review. AWS also launched the “Model Evaluation on Amazon Bedrock'' feature in re:Invent 2023 which is still in preview mode to create and view model evaluation jobs. It provides three options to evaluate LLMs: 1) to automatically evaluate a single model using recommended metrics for built-in task types, text summarization, question and answer, text classification, and open-ended text generation, 2) to evaluate up to 2 models using a work team of your choice to provide feedback, or 3) to customize the number of models to evaluate using a work-team designated by AWS. If you choose the human evaluation options you can also define a custom task on top of the built-in tasks.

In this post, we will walk you through an example of a HITL framework to evaluate the performance of LLMs available on Bedrock on your own generated datasets and specific tasks. You may find the code in this github repository.

Overall Solution Architecture:

From a high-level perspective, our solution consists of the following sections:

Model Repository
Prompt Repository
Prompt/Response Workflow Orchestration
UI to collect human evaluator feedback

We will be using Bedrock as the model repository for LLM and Amazon DynamoDB as a prompt catalog to store our datasets/prompts to be used later in the process. To orchestrate calling Bedrock APIs, parse the response and record the desired output, we use AWS Step Functions. Finally, we will use Streamlit to build a quick UI to collect user feedback. Please refer to the following diagram demonstrating how the components are interconnected to build our LLM benchmarking solution.

Model Repository

For our example, we have included the following LLMs from Bedrock:

Amazon:

Amazon Titan Large
Amazon Titan Express

Anthropic:

Anthropic Claude Instant
Anthropic Claude V2

Cohere:

Cohere Command

AI21:

Jurassic Mid
Jurassic Ultra

We will extend this list as new LLMs become available on Bedrock.

Prompt Repository

We need to store all the prompts for benchmarking LLMs, and DynamoDB is a great option to catalog and version prompts for repeatability. For our example, we have leveraged the following categories/datasets adapted from here. This includes knowledge, code, instruct, creativity, and reflexion. Below is a snapshot of the DynamoDB table that includes our curated prompts.

Prompt/Response Workflow Orchestration

We use Step Function to orchestrate several AWS Lambda functions in parallel by calling Bedrock APIs for each selected LLM and storing the responses in the DynamoDB table. Below, you can find the state machine language and resulting graph after a successful run.

Step Function Definition

Lambda Functions:

We have developed a Lambda function template and provided LLM-specific prompt configuration and parameters to build 4 Lambda functions, one for each LLM.

The outputs of these Lambda functions capture the LLM response, token size (if the output token size is part of the model response otherwise number of characters), latency and model parameters. The results are recorded in another DynamoDB table called “bedrock_benchmark”. The Step Function is scheduled to trigger the Lambda functions daily. If the latest response from an LLM in any category differs from its previous response, then a record is added to this table to detect and track drift.

Below you can see the Lambda function code.

We had to time the invocation latency and calculate output token size in the code, but while writing this blog post, Bedrock invocation metrics including InputTokenCount, OutputTokenCount, and InvocationLatency were added to CloudWatch Metrics and as part of the response as well. Thanks to AWS's continuous improvement, one can update the above code to retrieve the metrics and simplify the code.

As you can see in screenshots below, each LLM on Bedrock has a different input configuration, which is provided to the corresponding Lambda function as environment variables. You may tweak parameters influencing the LLM response, including temperature, top P, max token count, and top K. The naming convention across models might be different, which has made the Lambda function lengthy to accommodate all variations. If you implement this workflow for a real application, modify the parameters to fit your requirements.

Amazon Titan Lambda Function Variables:

Anthropic Claude Lambda Function Variables:

Cohere Lambda Function Variables:

AI21 Lambda Function Variables:

User Interface - feedback collection

We used Streamlit to build a simple but effective UI to present the cases where the response for any LLM in any category is different from its previous one and to collect feedback from a human evaluator. The user can choose between 0 to indicate “INCORRECT”, 1 for “CORRECT” response, and 2 for “EXCELLENT” to guide improvements.

In our Streamlit app, it first fetches all of the curated prompts as well as the most recent LLM response and their previous different response. Then it iterates through them by presenting the identified cases to the user, collecting user feedback accordingly and finally storing the feedback in the DynamoDB table. We deployed our Streamlit app in SageMaker Studio using a Terminal instance, and here is how it looks with an example.

If you keep running this workflow and collect user feedback accordingly, you can create informative radar charts to demonstrate each LLM's capabilities in different tasks quickly. You can also monitor the trend to observe the stability of a certain model over time.

Enhancing Model Monitoring

Going beyond the examples shared here, Caylent is extending this framework to include response performance and cost so that the price/performance of alternative models can also be compared to evaluate return on investment. We'll also be testing Model Evaluation on Amazon Bedrock during preview to take advantage of the new service capabilities. As LLMs continue to evolve rapidly, we recommend benchmarking regularly to ensure you're using the most appropriate model for your needs.

Conclusion

In this quick guide we presented a simple solution where you can build your curated dataset and the human-in-the-loop workflow using AWS services and Streamlit to benchmark LLMs available on Bedrock to choose the best option for your specific tasks.

Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there. At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Caylent Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas, we can implement our Generative AI Knowledge Base Caylent Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. Caylent’s AWS Generative AI Proof of Value Caylent Catalyst can help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way. For companies ready to take their generative AI initiatives beyond the scope of Caylent's prebuilt offerings, we can tailor an engagement exactly to your requirements.

Generative AI & LLMOps

Ali Arabi

Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker AI. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles

Randall Hunt

Randall Hunt, Chief Technology Officer at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.

View Randall's articles

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Build Generative AI Applications on AWS: Leverage Your Internal Data with Amazon Bedrock

Learn how to use Amazon Bedrock to build AI applications that will transform your proprietary documents, from technical manuals to internal policies, into a secure and accurate knowledge assistant.

Generative AI & LLMOps

July 21, 2025

Getting Started with Agentic Workflows

Learn the fundamentals of agentic workflows, covering design considerations, key AWS tools, and explore a step-by-step guide for building your first workflow using Amazon Bedrock.

Generative AI & LLMOps

July 16, 2025

Caylent Renews Strategic Collaboration Agreement with AWS to Deliver Industry-Specific GenAI Solutions

Creation of new industry principal strategists to shape go-to-market strategies and solutions to accelerate customer outcomes.

Caylent Announcements

Generative AI & LLMOps

View all blog posts

Overall Solution Architecture:

Model Repository

Prompt Repository

Prompt/Response Workflow Orchestration

Step Function Definition

Lambda Functions:

Amazon Titan Lambda Function Variables:

Anthropic Claude Lambda Function Variables:

Cohere Lambda Function Variables:

AI21 Lambda Function Variables:

User Interface - feedback collection

Enhancing Model Monitoring

Conclusion

Ali Arabi

Randall Hunt

Accelerate your GenAI initiatives

Related Blog Posts

Build Generative AI Applications on AWS: Leverage Your Internal Data with Amazon Bedrock

Getting Started with Agentic Workflows

Caylent Renews Strategic Collaboration Agreement with AWS to Deliver Industry-Specific GenAI Solutions