LLMOps Reference Architecture - MLOps for Large Language Models on AWS

Artificial Intelligence & MLOps

Begin building an LLMOps platform that can automate and operationalize LLM models from data preparation to prompt engineering and model fine tuning.

In my last post, I presented a reference architecture for building end-to-end MLOps solutions on AWS based on the SageMaker ecosystem. With the emergence of the Large Language Models (LLMs), enterprises are eager to adopt the latest AI advances to transform their business.

To succeed with generative AI, it is crucial to design a cost-effective, scalable, differentiated, and well-governed solution to manage the model lifecycle and account for the rapid pace of change. Enterprises need to leverage their substantial private data to tailor these applications for their specialized domain and competitive advantages. Seamlessly and efficiently integrating all of these concerns can be a complex undertaking, but this post will give the reader actionable advice to begin building their LLMOps platform.

Today I am proposing an extension of MLOps for the operationalization of Language Model (LLM) solutions, denoted as LLMOps, specifically on the Amazon Web Services (AWS) platform. The emphasis will be on the required modifications and augmentations necessary to tailor our MLOps framework to accommodate LLM projects. Keep in mind that this is a rapidly evolving field and our recommendations will evolve as AWS releases additional products in this field.

Data Preparation:

LLMs are typically trained on a huge amount of text data to learn how to generate the next token. To adapt them for a downstream task such as text classification, you need to prepare the relevant text and corresponding labels/annotations. In AWS, there are a few possible options among others to prepare data for training/tuning LLMs such as Amazon Textract and/or Amazon Comprehend part of AI Services that AWS offer to extract entities from images or documents, or classify documents. Furthermore, If you need to implement your own logic then you may leverage the lambda function. Finally Amazon SageMaker Ground Truth enables you to annotate or label the text data using human or by setting up an active learning workflow.

Image 1: Data Preparation for LLM

Prompt Engineering and Prompt Catalogs:

The simplest and cheapest approach to harness the capabilities of an LLM in your Gen AI application is to use “In-context learning”. It provides instructions at inference time using techniques like “zero-shot”, “one-shot”, and “few-shot” learning depending on how many demonstrations are provided as proposed in the paper called “Language Models are Few-Shot Learners”. There are also more advanced techniques such as chain-of-thought where steps of reasoning are provided when prompting the language model to be able to perform sophisticated tasks that require reasoning. When the number of projects utilizing LLMs in an organization grows, it is recommended that all the engineered prompts are cataloged and versioned for repeatability. This makes it simple to reuse prompts for additional projects and, more importantly, to be monitored for any drift over time.

Image 2: Prompt Catalog

Model Fine-Tuning:

Although using Prompt Engineering is comparably easiest to implement, it can only provide initial instructions to the model. Moreover, all the LLMs are constrained by the number of tokens that can be passed to them at the time of inference and using a long prompt can increase the total cost of ownership (TCO) if a paid service (per-token) is used. The next logical step is to fine tune an LLM on a curated dataset to make it more domain specific. Depending on the availability of labeled data and availability of computational power you will have two options: full fine-tuning or Parameter-Efficient fine tuning (PEFT). Full fine tuning the FM requires thousands of examples with high computational power to tune all the weights and biases of the selected model. On the other hand, PEFT can still be performed with tens of examples and way less computational power required making it a more cost effective option. Some of the common PEFT techniques include:

  • Low-Rank Adaptation (LoRA) which reduces the number of trainable parameters for downstream tasks by freezing the pre-trained model weights. This method involves adding adapters after each of the transformer sub-layers.
  • QLoRA extends LoRA by quantizing original weight values, from high-resolution data types, such as Float32, to lower-resolution data types like int4 resulting in reduced memory demand.
  • Prompt-tuning / Prefix-tuning by attaching a soft prompt to the top most embedding layer in the beginning of the transformer layers and training only the added prompt tokens while keeping trained LLM frozen.
  • LLaMa-Adapter a modified version of prefix tuning in which soft prompts are added at the N top-most transformer layers beside initializing the parameters near the attention mechanism to zero instead of at random to avoid “corruption” of LLaMa’s original knowledge. 

You can use LLMs available from leading AI startups through APIs on Amazon Bedrock, LLMs maintained on SageMaker JumpStart, or any open-source or proprietary LLM model on SageMaker to train or fine-tune on custom datasets. If you prefer to keep full control on the training pipeline and you do not need out-of the box capability provided by SageMaker such as data capture, model monitoring, auto scaling you may choose to run the workload on EKS or Amazon EC2 DL1 instances a low cost-to-train option for deep learning models.

Image 3: LLM Tuning Pipeline

Retrieval Augmented Generation (RAG):

If you need LLM to create a response that can leverage additional context present in the proprietary data then RAG is the solution. This technique uses semantic search to retrieve relevant and timely context to augment the LLM’s responses for more accurate results. It typically involves creating embeddings and setting up vector stores. In AWS there are several options to choose for the vector store, including pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, Amazon OpenSearch Serverless Vector Store, and Amazon Kendra. The benefit of using Amazon Kendra though is that it can be used both to create embeddings as well as to perform semantic search, so you are not dependent on another FM model for embedding. This makes your Gen AI application design much simpler and modular. It is also worth noting that compared to fine tuning, RAG has higher flexibility and is more cost effective. Many solutions will use a combination of fine-tuning and RAG.

Image 4: Retrieval Augmented Generation (RAG)

Processing Power:

Due to the large size of Large language models and the large datasets required to train or fine-tune a language model, using GPUs is typically required to be able to fit the model or the data in the memory of the instance. AWS has additional options beyond GPUs like the AWS Trainium instance which is a machine learning (ML) accelerator purposely built for deep learning training of 100B+ parameter models. You can further leverage AWS’s custom silicon for inference with AWS Infrentia2, which provides high performance at the lowest cost for deep learning inference.

In many cases distributing the model or data in several GPU units is required. SageMaker supports distributed training on single and multiple instances out of the box. It also supports other frameworks and packages such as PyTorch DistributedDataParallel (DDP) for which all you need to do is to add additional configuration in your training estimator as illustrated in the below code snippet.

Image 5: Distributed Training on Sagemaker

When using data or model parallelism features in SageMaker it is recommended to use instances with high-throughput, low-latency.


If you are using LLMs for traditional ML problems such as sentiment analysis or classification in general you can still use traditional ML performance metrics such as accuracy, precision, and recall. However, to evaluate the LLM's performance depending on the task it is used for, we may use different metrics. Some of well known ones are:

  • BLEU for machine translation which compares precision of n-grams between reference input and generated output
  • METEOR for machine translation which is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
  • ROUGE for summarization which considers precision, recall, and F1-Score over a sequence.
  • Perplexity for text generation which measures how well the trained model has learned the distribution in text.
  • BERTScore for text generation (embedding similarity)
  • CIDEr that measures the similarity between a generated caption for an image and the reference captions.
  • SPICE that evaluates caption generated for an image by placing importance on capturing details about objects, attributes, relationships (semantic propositions)
  • Using larger models with a bigger number of parameters to evaluate responses generated by smaller language models
  • Universal benchmarks for a range of tasks: machine translation, summarization, natural language generation, question answering, and more. Some of the famous benchmarks are Pile, GLUE, SuperGLUE, MMLU, LAMBADA, and Big Bench Collaboration Benchmarks,
  • Code specific benchmarks: HumanEval, and Mostly Basic Python Programming (MBPP)
  • Human in the loop (HITL) where human evaluators provide feedback on the quality of the generated text by the model

Hosting Options:

Traditional hosting solutions used for smaller models do not provide required optimization functionality to host LLM models with optimal inference latency and/or throughput. With SageMaker you can use large model inference container images with Deep Java Library (DJL) Serving. SageMaker supports model parallelism and inference optimization libraries including:

Additionally you can host LLMs on SageMaker using Hugging Face LLM Deep Learning Container (DLC) enabling high-performance text generation through tensor parallelism, dynamic batching, and model quantization.


Prompts, generated response, and RAG performance need to be closely monitored in production along with data quality, model quality, infrastructure utilization, endpoint latency and throughput.

It is also important to compare their capabilities to a baseline overtime for any drift or eventual degradation of quality in the generated response. 

Some of the proposed metrics to monitor prompts/responses are: context, relevance to prompt, existence of pattern (repetitiveness), repeatability, readability, token size, jailbreaking, injection, refusal, sentiment, toxicity, response hallucination. For RAG it is recommended to monitor chunk size, produced embeddings, embedding speed, and semantic search results. All of these aspects need to be monitored over time to facilitate timely remediation through retraining.

It is also possible to implement a Shadow Testing or AB Testing pipeline to compare different prompts, embeddings, and/or LLMs over time to choose the best one for different use cases. SageMaker provides these features out of the box and they can also be combined with other open source tools such as MLflow to get the best of all of them.

In many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain you can use the human-in-the-loop capabilities of SageMaker through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production.

Image 6: Human in the loop - SageMaker A2I

Final LLMOps Solution Architecture:

If we add all the above components to our existing MLOps solution it will turn to the following high level architecture which serves as an end-to-end solution to automate and operationalize the LLM models. Of course, AWS is always releasing new services so you should expect this architecture to change slightly after re:Invent 2023.

Image 7: LLMOps Reference Architecture on AWS

If you’d like to turn your generative AI vision into reality with strategy workshops, rapid prototyping, and production-ready projects leveraging Caylent’s deep AWS MLOps and GenAI experience, get in touch with our Artificial Intelligence & MLOps experts to see how we can help you design and implement scalable and future-proof solutions. For some examples, take a look at our Generative AI offerings.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Artificial Intelligence & MLOps
Ali Arabi

Ali Arabi

Ali Arabi is a Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using AWS SageMaker. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.

View Ali's articles

Related Blog Posts

Building a RAG AI with OpenSearch Serverless and LangChain

Learn how to build a RAG-based GenAI bot on AWS using OpenSearch Serverless, through our step-by-step example.

Artificial Intelligence & MLOps

An Overview of Generative AI Keywords and Technologies

Understand key concepts like Large Language Models (LLMs), Retrieval Augmented Generation (RAG) & Prompt Engineering to arm you with the knowledge needed to leverage the remarkable capabilities of GenAI.

Artificial Intelligence & MLOps

Building Generative AI Apps with Amazon Bedrock

Explore the basics of GenAI, the necessary skills needed to utilize it, resources you need to build your own AI apps, and how to use Amazon Bedrock to reduce the initial investments towards getting started.

Artificial Intelligence & MLOps