Reducing GenAI Cost: 5 Strategies
Reduce GenAI costs with five proven strategies, from agentic architectures to advanced retrieval. Optimize performance, scale efficiently, and maximize AI value.
Discover how to build an LLMOps platform on AWS. Learn about data preparation, prompt engineering, model fine-tuning, and more to operationalize LLMs on AWS.
Reduce GenAI costs with five proven strategies, from agentic architectures to advanced retrieval. Optimize performance, scale efficiently, and maximize AI value.
Explore Amazon Nova Sonic, AWS’s new unified Speech-to-Speech model on Amazon Bedrock, that enables real-time voice interactions with ultra-low latency, enhancing user experience in voice-first applications.
Learn everything you need to know about Amazon Nova Act, a groundbreaking AI-powered tool that combines intelligent UI understanding with a Python SDK, enabling developers to create more reliable browser automation compared to traditional methods.
Leveraging our accelerators and technical experience
Browse GenAI OfferingsAli Arabi is a Senior Machine Learning Architect at Caylent with extensive experience in solving business problems by building and operationalizing end-to-end cloud-based Machine Learning and Deep Learning solutions and pipelines using Amazon SageMaker AI. He holds an MBA and MSc Data Science & Analytics degree and is AWS Certified Machine Learning professional.
View Ali's articlesIn my last post, I presented a reference architecture for building end-to-end MLOps solutions on AWS based on the SageMaker ecosystem. With the emergence of the Large Language Models (LLMs), enterprises are eager to adopt the latest AI advances to transform their business.
To succeed with generative AI, it is crucial to design a cost-effective, scalable, differentiated, and well-governed solution to manage the model lifecycle and account for the rapid pace of change. Enterprises need to leverage their substantial private data to tailor these applications for their specialized domain and competitive advantages. Seamlessly and efficiently integrating all of these concerns can be a complex undertaking, but this post will give the reader actionable advice to begin building their LLMOps platform.
LLMOps, or Large Language Model Operations, refers to the set of practices, tools, and workflows required to effectively build, deploy, monitor, and maintain large language models (LLMs) in production environments. It encompasses the entire lifecycle of LLMs, from data preparation and model training to deployment, monitoring, and iterative improvement. LLMOps aims to ensure that these complex models are robust, scalable, and aligned with business objectives, while also adhering to ethical and governance standards.
LLMOps involves a series of best practices designed to streamline the workflow associated with the development and upkeep of large language models. These practices include automated data pipeline creation, prompt engineering, model fine-tuning, and continuous monitoring. By standardizing these processes, LLMOps aims to streamline model training, enhance model performance, and ensure compliance with organizational policies and external regulations.
While both LLMOps and MLOps focus on the operationalization of machine learning models, they differ in scope and complexity due to the unique characteristics of large language models.
In summary, LLMOps extends the principles of MLOps to address the unique challenges posed by large language models, ensuring they are effectively integrated into production environments.
Today, I am proposing an extension of MLOps for the operationalization of Language Model (LLM) solutions, denoted as LLMOps, specifically on the Amazon Web Services (AWS) platform. The emphasis will be on the modifications and augmentations necessary to tailor our MLOps framework to accommodate LLM projects. Keep in mind that this is a rapidly evolving field, and our recommendations will evolve as AWS releases additional products in this field.
LLMs are typically trained on a huge amount of text data to learn how to generate the next token. To adapt them for a downstream task such as text classification, you must prepare the relevant text and corresponding labels/annotations.
In AWS, there are a few possible options, among others to prepare data for training/tuning LLMs, such as Amazon Textract and/or Amazon Comprehend. These are part of AI Services that AWS offers to extract entities from images or documents, or classify documents. Furthermore, If you need to implement your own logic, then you may leverage the lambda function. Finally, Amazon SageMaker Ground Truth enables you to annotate or label text data using humans or by setting up an active learning workflow.
The simplest and cheapest approach to harness the capabilities of an LLM in your GenAI application is to use “In-context learning”. It provides instructions at inference time using techniques like “zero-shot”, “one-shot”, and “few-shot” learning depending on how many demonstrations are provided, as proposed in the paper called “Language Models are Few-Shot Learners”.
There are also more advanced techniques, such as chain-of-thought, where you provide steps of reasoning when prompting the language model to be able to perform sophisticated tasks. When the number of projects utilizing LLMs in an organization grows, it is recommended that all the engineered prompts are cataloged and versioned for repeatability. This makes it simple to reuse prompts for additional projects and, more importantly, to be monitored for any drift over time.
Although using Prompt Engineering is comparably most straightforward to implement, it can only provide initial instructions to the model. Moreover, all the LLMs are constrained by the number of tokens that you can pass to them at the time of inference.Also, using a long prompt can increase the total cost of ownership (TCO) if you use a paid service (per-token). The next logical step is to fine-tune an LLM on a curated dataset to make it more domain-specific.
Depending on the availability of labeled data and availability of computational power, you will have two options: full fine-tuning or Parameter-Efficient fine tuning (PEFT). Full fine tuning the FM requires thousands of examples with high computational power to tune all the weights and biases of the selected model. On the other hand, PEFT can still be performed with tens of examples, and way less computational power is required, making it a more cost effective option. You can use LLMs available from leading AI startups through APIs on Amazon Bedrock, LLMs maintained on SageMaker JumpStart, or any open-source or proprietary LLM model on SageMaker AI to train or fine-tune on custom datasets. If you prefer to keep full control of the training pipeline and you do not need out-of the box capability provided by SageMaker AI, such as data capture, model monitoring, and auto scaling, you may choose to run the workload on EKS or Amazon EC2 DL1 instances a low cost-to-train option for deep learning models.
If you need LLM to create a response that can leverage additional present in the proprietary data, then RAG is the solution. This technique uses semantic search to retrieve relevant and timely context to augment the LLM’s responses for more accurate results. It typically involves creating embeddings and setting up vector stores.
In AWS there are several options to choose from for the vector store, including pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, Amazon OpenSearch Serverless Vector Store, and Amazon Kendra. The benefit of using Amazon Kendra though, is that you can use it to create embeddings as well as to perform semantic search, so you are not dependent on another FM model for embedding. This makes your Gen AI application design much simpler and modular.
It is also worth noting that compared to fine tuning, RAG has higher flexibility and is more cost effective. Many solutions will use a combination of fine-tuning and RAG.
Due to the large size of Large language models and the large datasets required to train or fine-tune a language model, using GPUs is typically required to be able to fit the model or the data in the memory of the instance. AWS has additional options beyond GPUs like the AWS Trainium instance, which is a machine learning (ML) accelerator purposely built for deep learning training of 100B+ parameter models. You can further leverage AWS’s custom silicon for inference with AWS Infrentia2, which provides high performance at the lowest cost for deep learning inference.
In many cases distributing the model or data in several GPU units is required. SageMaker AI supports distributed training on single and multiple instances out of the box. It also supports other frameworks and packages, such as PyTorch DistributedDataParallel (DDP), for which all you need to do is to add additional configuration in your training estimator, as illustrated in the below code snippet.
When using data or model parallelism features in SageMaker AI it is recommended to use instances with high-throughput, low-latency.
If you use LLMs for traditional ML problems such as sentiment analysis or classification in general, you can still use traditional ML performance metrics such as accuracy, precision, and recall. However, to evaluate the LLM's performance, depending on the task you use it for, we may use different metrics. Some of the well known ones are:
Traditional hosting solutions used for smaller models do not provide the required optimization functionality to host LLM models with optimal inference latency and/or throughput. With SageMaker AI you can use large model inference container images with Deep Java Library (DJL) Serving. SageMaker AI supports model parallelism and inference optimization libraries, including:
Additionally, you can host LLMs on SageMaker AI using Hugging Face LLM Deep Learning Container (DLC), enabling high-performance text generation through tensor parallelism, dynamic batching, and model quantization.
You should closely monitor prompts, generated response, and RAG performance in production along with data quality, model quality, infrastructure utilization, endpoint latency and throughput.
It is also important to compare their capabilities to a baseline over time for any drift or eventual degradation of quality in the generated response.
Some of the proposed metrics to monitor prompts/responses are context, relevance to prompt, existence of pattern (repetitiveness), repeatability, readability, token size, jailbreaking, injection, refusal, sentiment, toxicity, and response hallucination. For RAG it is recommended to monitor chunk size, produced embeddings, embedding speed, and semantic search results. All of these aspects need to be monitored over time to facilitate timely remediation through retraining.
It is also possible to implement a Shadow Testing or AB Testing pipeline to compare different prompts, embeddings, and/or LLMs over time to choose the best one for different use cases. SageMaker AI provides these features out of the box and you can also combine it with other open source tools such as MLflow to get the best of all of them.
In many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain you can use the human-in-the-loop capabilities of SageMaker AI through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production.
If we add all the above components to our existing MLOps solution, it will turn to the following high level architecture, which serves as an end-to-end solution to automate and operationalize the LLM models. Of course, AWS is always releasing new services so you should expect this architecture to change slightly after releases.
If you’d like to turn your generative AI vision into reality with strategy workshops, rapid prototyping, and production-ready projects leveraging Caylent’s deep AWS MLOps and GenAI experience, get in touch with our Artificial Intelligence & MLOps experts to see how we can help you design and implement scalable and future-proof solutions. For some examples, take a look at our Generative AI offerings.
Image 1: Data Preparation for LLM
Image 2: Prompt Catalog
Image 3: LLM Tuning Pipeline
Image 4: Retrieval Augmented Generation (RAG)
Image 5: Distributed Training on Sagemaker
Image 6: Human in the loop - SageMaker A2I
Image 7: LLMOps Reference Architecture on AWS