Evaluating LLM Performance: A Benchmarking Framework on Amazon Bedrock

Artificial Intelligence & MLOps

Generative AI (GenAI) creates new opportunities for automated benchmarking by adding output variability and model cost dimensions to traditional performance metrics. In this blog, we share a framework for monitoring alignment and drift across several Large Language Models (LLMs) hosted on Amazon Bedrock.


New LLMs are being released on a daily/weekly basis, and practitioners need to evaluate their performance to identify the most suitable ones for specific tasks. For that, different benchmark/evaluation tools and datasets are developed depending on the task the LLM is designed to tackle, whether it is coding, chatbot, summarization, instruction following ability, etc. Some of the popular benchmark datasets and metrics are the HumanEval dataset (code generation benchmark), the MT Bench (instruction-following ability benchmark), the MBPP (python code generation benchmark), and the Chatbot Arena (chatbot assistant benchmark).

Assessing LLM performance through automated benchmarking poses a considerable challenge due to the open-ended nature of the tasks, making it highly challenging to develop a program for automated evaluation of response quality. Hence, Human-in-the-loop (HITL), where human evaluators provide feedback on the quality of the text generated by the LLMs for different tasks, proves to be a viable solution. Amazon Comprehend's LLM Trust and Safety features offer an automated approach that can complement human review. AWS also launched the “Model Evaluation on Amazon Bedrock'' feature in re:Invent 2023 which is still in preview mode to create and view model evaluation jobs. It provides three options to evaluate LLMs: 1) to automatically evaluate a single model using recommended metrics for built-in task types, text summarization, question and answer, text classification, and open-ended text generation, 2) to evaluate up to 2 models using a work team of your choice to provide feedback, or 3) to customize the number of models to evaluate using a work-team designated by AWS. If you choose the human evaluation options you can also define a custom task on top of the built-in tasks.

In this post, we will walk you through an example of a HITL framework to evaluate the performance of LLMs available on Bedrock on your own generated datasets and specific tasks. You may find the code in this github repository.

Overall Solution Architecture:

From a high-level perspective, our solution consists of the following sections:

  • Model Repository
  • Prompt Repository
  • Prompt/Response Workflow Orchestration
  • UI to collect human evaluator feedback

We will be using Bedrock as the model repository for LLM and Amazon DynamoDB as a prompt catalog to store our datasets/prompts to be used later in the process. To orchestrate calling Bedrock APIs, parse the response and record the desired output, we use AWS Step Functions. Finally, we will use Streamlit to build a quick UI to collect user feedback. Please refer to the following diagram demonstrating how the components are interconnected to build our LLM benchmarking solution.

Model Repository

For our example, we have included the following LLMs from Bedrock:

Amazon:

  • Amazon Titan Large
  • Amazon Titan Express

Anthropic:

  • Anthropic Claude Instant
  • Anthropic Claude V2

Cohere:

  • Cohere Command

AI21:

  • Jurassic Mid
  • Jurassic Ultra

We will extend this list as new LLMs become available on Bedrock.

Prompt Repository

We need to store all the prompts for benchmarking LLMs, and DynamoDB is a great option to catalog and version prompts for repeatability. For our example, we have leveraged the following categories/datasets adapted from here. This includes knowledge, code, instruct, creativity, and reflexion. Below is a snapshot of the DynamoDB table that includes our curated prompts.

Prompt/Response Workflow Orchestration

We use Step Function to orchestrate several AWS Lambda functions in parallel by calling Bedrock APIs for each selected LLM and storing the responses in the DynamoDB table. Below, you can find the state machine language and resulting graph after a successful run.

Step Function Definition
Lambda Functions:

We have developed a Lambda function template and provided LLM-specific prompt configuration and parameters to build 4 Lambda functions, one for each LLM.

The outputs of these Lambda functions capture the LLM response, token size (if the output token size is part of the model response otherwise number of characters), latency and model parameters. The results are recorded in another DynamoDB table called “bedrock_benchmark”. The Step Function is scheduled to trigger the Lambda functions daily. If the latest response from an LLM in any category differs from its previous response, then a record is added to this table to detect and track drift.

Below you can see the Lambda function code.

We had to time the invocation latency and calculate output token size in the code, but while writing this blog post, Bedrock invocation metrics including InputTokenCount, OutputTokenCount, and InvocationLatency were added to CloudWatch Metrics and as part of the response as well. Thanks to AWS's continuous improvement, one can update the above code to retrieve the metrics and simplify the code.

As you can see in screenshots below, each LLM on Bedrock has a different input configuration, which is provided to the corresponding Lambda function as environment variables. You may tweak parameters influencing the LLM response, including temperature, top P, max token count, and top K. The naming convention across models might be different, which has made the Lambda function lengthy to accommodate all variations. If you implement this workflow for a real application, modify the parameters to fit your requirements.

 Amazon Titan Lambda Function Variables:
Anthropic Claude Lambda Function Variables:
Cohere Lambda Function Variables:
AI21 Lambda Function Variables:

User Interface - feedback collection

We used Streamlit to build a simple but effective UI to present the cases where the response for any LLM in any category is different from its previous one and to collect feedback from a human evaluator. The user can choose between 0 to indicate “INCORRECT”, 1 for “CORRECT” response, and 2 for “EXCELLENT” to guide improvements.

In our Streamlit app, it first fetches all of the curated prompts as well as the most recent LLM response and their previous different response. Then it iterates through them by presenting the identified cases to the user, collecting user feedback accordingly and finally storing the feedback in the DynamoDB table. We deployed our Streamlit app in SageMaker Studio using a Terminal instance, and here is how it looks with an example.

If you keep running this workflow and collect user feedback accordingly, you can create informative radar charts to demonstrate each LLM's capabilities in different tasks quickly. You can also monitor the trend to observe the stability of a certain model over time.

Enhancing Model Monitoring

Going beyond the examples shared here, Caylent is extending this framework to include response performance and cost so that the price/performance of alternative models can also be compared to evaluate return on investment. We'll also be testing Model Evaluation on Amazon Bedrock during preview to take advantage of the new service capabilities. As LLMs continue to evolve rapidly, we recommend benchmarking regularly to ensure you're using the most appropriate model for your needs.

Conclusion

In this quick guide we presented a simple solution where you can build your curated dataset and the human-in-the-loop workflow using AWS services and Streamlit to benchmark LLMs available on Bedrock to choose the best option for your specific tasks.

Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there. At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Caylent Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas, we can implement our Generative AI Knowledge Base Caylent Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. Caylent’s AWS Generative AI Proof of Value Caylent Catalyst can help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way. For companies ready to take their generative AI initiatives beyond the scope of Caylent's prebuilt offerings, we can tailor an engagement exactly to your requirements.

Artificial Intelligence & MLOps
Ali Arabi

Ali Arabi

Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles
Randall Hunt

Randall Hunt

Randall Hunt, VP of Cloud Strategy and Innovation at Caylent, is a technology leader, investor, and hands-on-keyboard coder based in Los Angeles, CA. Previously, Randall led software and developer relations teams at Facebook, SpaceX, AWS, MongoDB, and NASA. Randall spends most of his time listening to customers, building demos, writing blog posts, and mentoring junior engineers. Python and C++ are his favorite programming languages, but he begrudgingly admits that Javascript rules the world. Outside of work, Randall loves to read science fiction, advise startups, travel, and ski.

View Randall's articles

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

OpenAI vs Bedrock: Optimizing Generative AI on AWS

The AI industry is growing rapidly and a variety of models now exist to tackle different use cases. Amazon Bedrock provides access to diverse AI models, seamless AWS integration, and robust security, making it a top choice for businesses who want to pursue innovation without vendor lock-in.

Artificial Intelligence & MLOps

AI-Augmented OCR with Amazon Textract

Learn how organizations can eliminate manual data extraction with Amazon Textract, a cutting-edge tool that uses machine learning to extract and organize text and data from scanned documents.

Artificial Intelligence & MLOps

Building Recommendation Systems Using Generative AI and Amazon Personalize

In this blog, learn how Generative AI augmented recommendation systems can improve the quality of customer interactions and produce higher quality data to train analytical ML models, taking personalized customer experiences to the next level.

Artificial Intelligence & MLOps