2025 GenAI Whitepaper

LLMOps on AWS: Reference Architecture

Generative AI & LLMOps

Discover how to build an LLMOps platform on AWS. Learn about data preparation, prompt engineering, model fine-tuning, and more to operationalize LLMs on AWS.

In my last post, I presented a reference architecture for building end-to-end MLOps solutions on AWS based on the SageMaker ecosystem. With the emergence of the Large Language Models (LLMs), enterprises are eager to adopt the latest AI advances to transform their business.

To succeed with generative AI, it is crucial to design a cost-effective, scalable, differentiated, and well-governed solution to manage the model lifecycle and account for the rapid pace of change. Enterprises need to leverage their substantial private data to tailor these applications for their specialized domain and competitive advantages. Seamlessly and efficiently integrating all of these concerns can be a complex undertaking, but this post will give the reader actionable advice to begin building their LLMOps platform.

What is LLMOps?

LLMOps, or Large Language Model Operations, refers to the set of practices, tools, and workflows required to effectively build, deploy, monitor, and maintain large language models (LLMs) in production environments. It encompasses the entire lifecycle of LLMs, from data preparation and model training to deployment, monitoring, and iterative improvement. LLMOps aims to ensure that these complex models are robust, scalable, and aligned with business objectives, while also adhering to ethical and governance standards.

LLMOps involves a series of best practices designed to streamline the workflow associated with the development and upkeep of large language models. These practices include automated data pipeline creation, prompt engineering, model fine-tuning, and continuous monitoring. By standardizing these processes, LLMOps aims to streamline model training, enhance model performance, and ensure compliance with organizational policies and external regulations.

LLMOps vs MLOps

While both LLMOps and MLOps focus on the operationalization of machine learning models, they differ in scope and complexity due to the unique characteristics of large language models.

  • Scale and Complexity: LLMs are typically orders of magnitude larger than traditional ML models, requiring more computational resources for training and inference. LLMOps must therefore incorporate specialized techniques for handling large-scale data and models, such as distributed training and inference optimization.
  • Prompt Engineering: Unlike traditional ML models, LLMs often rely heavily on prompt engineering to achieve specific tasks. LLMOps includes practices for creating, versioning, and monitoring prompts, which is less prevalent in general MLOps.
  • Evaluation Metrics: The performance metrics for LLMs often differ from those used in traditional ML. LLMOps includes specialized evaluation techniques like BLEU, METEOR, ROUGE, and human-in-the-loop assessments to gauge the quality of generated text.
  • Retrieval Augmented Generation (RAG): LLMOps frequently employs RAG to incorporate external, proprietary data into model responses, adding another layer of complexity not commonly found in traditional ML workflows.
  • Hosting and Optimization: Due to the size of LLMs, LLMOps requires advanced hosting solutions and inference optimization techniques, such as model parallelism and tensor parallelism, which may not be necessary for smaller ML models.

In summary, LLMOps extends the principles of MLOps to address the unique challenges posed by large language models, ensuring they are effectively integrated into production environments.

Adding LLM components to MLOps AWS Architecture

Today, I am proposing an extension of MLOps for the operationalization of Language Model (LLM) solutions, denoted as LLMOps, specifically on the Amazon Web Services (AWS) platform. The emphasis will be on the modifications and augmentations necessary to tailor our MLOps framework to accommodate LLM projects. Keep in mind that this is a rapidly evolving field, and our recommendations will evolve as AWS releases additional products in this field.

Data Preparation

LLMs are typically trained on a huge amount of text data to learn how to generate the next token. To adapt them for a downstream task such as text classification, you must prepare the relevant text and corresponding labels/annotations.

In AWS, there are a few possible options, among others to prepare data for training/tuning LLMs, such as Amazon Textract and/or Amazon Comprehend. These are part of AI Services that AWS offers to extract entities from images or documents, or classify documents. Furthermore, If you need to implement your own logic, then you may leverage the lambda function. Finally, Amazon SageMaker Ground Truth enables you to annotate or label text data using humans or by setting up an active learning workflow.

Image 1: Data Preparation for LLM

Prompt Engineering and Prompt Catalogs

The simplest and cheapest approach to harness the capabilities of an LLM in your GenAI application is to use “In-context learning”. It provides instructions at inference time using techniques like “zero-shot”, “one-shot”, and “few-shot” learning depending on how many demonstrations are provided, as proposed in the paper called “Language Models are Few-Shot Learners”.

Prompt Engineering

There are also more advanced techniques, such as chain-of-thought, where you provide steps of reasoning when prompting the language model to be able to perform sophisticated tasks. When the number of projects utilizing LLMs in an organization grows, it is recommended that all the engineered prompts are cataloged and versioned for repeatability. This makes it simple to reuse prompts for additional projects and, more importantly, to be monitored for any drift over time.

Image 2: Prompt Catalog

Model Fine-Tuning

Although using Prompt Engineering is comparably most straightforward to implement, it can only provide initial instructions to the model. Moreover, all the LLMs are constrained by the number of tokens that you can pass to them at the time of inference.Also, using a long prompt can increase the total cost of ownership (TCO) if you use a paid service (per-token). The next logical step is to fine-tune an LLM on a curated dataset to make it more domain-specific.

Depending on the availability of labeled data and availability of computational power, you will have two options: full fine-tuning or Parameter-Efficient fine tuning (PEFT). Full fine tuning the FM requires thousands of examples with high computational power to tune all the weights and biases of the selected model. On the other hand, PEFT can still be performed with tens of examples, and way less computational power is required, making it a more cost effective option. You can use LLMs available from leading AI startups through APIs on Amazon Bedrock, LLMs maintained on SageMaker JumpStart, or any open-source or proprietary LLM model on SageMaker AI to train or fine-tune on custom datasets. If you prefer to keep full control of the training pipeline and you do not need out-of the box capability provided by SageMaker AI, such as data capture, model monitoring, and auto scaling, you may choose to run the workload on EKS or Amazon EC2 DL1 instances a low cost-to-train option for deep learning models.

Image 3: LLM Tuning Pipeline

Retrieval Augmented Generation (RAG)

If you need LLM to create a response that can leverage additional present in the proprietary data, then RAG is the solution. This technique uses semantic search to retrieve relevant and timely context to augment the LLM’s responses for more accurate results. It typically involves creating embeddings and setting up vector stores.

Retrieval Augmented Generation - Neural NebulAI Episode 9

In AWS there are several options to choose from for the vector store, including pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, Amazon OpenSearch Serverless Vector Store, and Amazon Kendra. The benefit of using Amazon Kendra though, is that you can use it to create embeddings as well as to perform semantic search, so you are not dependent on another FM model for embedding. This makes your Gen AI application design much simpler and modular.

It is also worth noting that compared to fine tuning, RAG has higher flexibility and is more cost effective. Many solutions will use a combination of fine-tuning and RAG.

Image 4: Retrieval Augmented Generation (RAG)

Processing Power

Due to the large size of Large language models and the large datasets required to train or fine-tune a language model, using GPUs is typically required to be able to fit the model or the data in the memory of the instance. AWS has additional options beyond GPUs like the AWS Trainium instance, which is a machine learning (ML) accelerator purposely built for deep learning training of 100B+ parameter models. You can further leverage AWS’s custom silicon for inference with AWS Infrentia2, which provides high performance at the lowest cost for deep learning inference.

In many cases distributing the model or data in several GPU units is required. SageMaker AI supports distributed training on single and multiple instances out of the box. It also supports other frameworks and packages, such as PyTorch DistributedDataParallel (DDP), for which all you need to do is to add additional configuration in your training estimator, as illustrated in the below code snippet.

Image 5: Distributed Training on Sagemaker

When using data or model parallelism features in SageMaker AI it is recommended to use instances with high-throughput, low-latency.

Evaluation

If you use LLMs for traditional ML problems such as sentiment analysis or classification in general, you can still use traditional ML performance metrics such as accuracy, precision, and recall. However, to evaluate the LLM's performance, depending on the task you use it for, we may use different metrics. Some of the well known ones are:

  • BLEU for machine translation, which compares the precision of n-grams between reference input and generated output
  • METEOR for machine translation which is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
  • ROUGE for summarization, which considers precision, recall, and F1-Score over a sequence.
  • Perplexity for text generation, which measures how well the trained model has learned the distribution in text.
  • BERTScore for text generation (embedding similarity)
  • CIDEr measures the similarity between a generated caption for an image and the reference captions.
  • SPICE evaluates the caption generated for an image by placing importance on capturing details about objects, attributes, relationships (semantic propositions)
  • Using larger models with a bigger number of parameters to evaluate responses generated by smaller language models
  • Universal benchmarks for a range of tasks: machine translation, summarization, natural language generation, question answering, and more. Some of the famous benchmarks are Pile, GLUE, SuperGLUE, MMLU, LAMBADA, and Big Bench Collaboration Benchmarks,
  • Code specific benchmarks: HumanEval, and Mostly Basic Python Programming (MBPP)
  • Human in the loop (HITL), where human evaluators provide feedback on the quality of the generated text by the model

Hosting Options

Traditional hosting solutions used for smaller models do not provide the required optimization functionality to host LLM models with optimal inference latency and/or throughput. With SageMaker AI you can use large model inference container images with Deep Java Library (DJL) Serving. SageMaker AI supports model parallelism and inference optimization libraries, including:

  • DeepSpeed is an open-source inference optimization library
  • Hugging Face Accelerate is an open-source model parallel inference library
  • FasterTransformer is an Nvidia open source library for efficiently running transformer-based neural network inference

Additionally, you can host LLMs on SageMaker AI using Hugging Face LLM Deep Learning Container (DLC), enabling high-performance text generation through tensor parallelism, dynamic batching, and model quantization.

Monitoring

You should closely monitor prompts, generated response, and RAG performance in production along with data quality, model quality, infrastructure utilization, endpoint latency and throughput.

It is also important to compare their capabilities to a baseline over time for any drift or eventual degradation of quality in the generated response. 

Monitoring metrics

Some of the proposed metrics to monitor prompts/responses are context, relevance to prompt, existence of pattern (repetitiveness), repeatability, readability, token size, jailbreaking, injection, refusal, sentiment, toxicity, and response hallucination. For RAG it is recommended to monitor chunk size, produced embeddings, embedding speed, and semantic search results. All of these aspects need to be monitored over time to facilitate timely remediation through retraining.

Testing strategies

It is also possible to implement a Shadow Testing or AB Testing pipeline to compare different prompts, embeddings, and/or LLMs over time to choose the best one for different use cases. SageMaker AI provides these features out of the box and you can also combine it with other open source tools such as MLflow to get the best of all of them.

SageMaker A2I

In many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain you can use the human-in-the-loop capabilities of SageMaker AI through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production.

Image 6: Human in the loop - SageMaker A2I

Final LLMOps Solution Architecture

If we add all the above components to our existing MLOps solution, it will turn to the following high level architecture, which serves as an end-to-end solution to automate and operationalize the LLM models. Of course, AWS is always releasing new services so you should expect this architecture to change slightly after releases.

Image 7: LLMOps Reference Architecture on AWS

The Caylent approach to Generative AI

If you’d like to turn your generative AI vision into reality with strategy workshops, rapid prototyping, and production-ready projects leveraging Caylent’s deep AWS MLOps and GenAI experience, get in touch with our Artificial Intelligence & MLOps experts to see how we can help you design and implement scalable and future-proof solutions. For some examples, take a look at our Generative AI offerings.

How Caylent Can Help with Your GenAI Projects

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Generative AI & LLMOps
Ali Arabi

Ali Arabi

Ali Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker AI. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles

Related Blog Posts

Introducing Amazon Nova Sonic: Real-Time Conversation Redefined

Explore Amazon Nova Sonic, AWS’s new unified Speech-to-Speech model on Amazon Bedrock, that enables real-time voice interactions with ultra-low latency, enhancing user experience in voice-first applications.

Generative AI & LLMOps

Amazon Nova Act: Building Reliable Browser Agents

Learn everything you need to know about Amazon Nova Act, a groundbreaking AI-powered tool that combines intelligent UI understanding with a Python SDK, enabling developers to create more reliable browser automation compared to traditional methods.

Generative AI & LLMOps

Amazon Bedrock Pricing Explained

Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.

Generative AI & LLMOps