Prompt Engineering and Prompt Catalogs
The simplest and cheapest approach to harness the capabilities of an LLM in your GenAI application is to use “In-context learning”. It provides instructions at inference time using techniques like “zero-shot”, “one-shot”, and “few-shot” learning depending on how many demonstrations are provided, as proposed in the paper called “Language Models are Few-Shot Learners”.
There are also more advanced techniques, such as chain-of-thought, where you provide steps of reasoning when prompting the language model to be able to perform sophisticated tasks. When the number of projects utilizing LLMs in an organization grows, it is recommended that all the engineered prompts are cataloged and versioned for repeatability. This makes it simple to reuse prompts for additional projects and, more importantly, to be monitored for any drift over time.
Model Fine-Tuning
Although using Prompt Engineering is comparably most straightforward to implement, it can only provide initial instructions to the model. Moreover, all the LLMs are constrained by the number of tokens that you can pass to them at the time of inference.Also, using a long prompt can increase the total cost of ownership (TCO) if you use a paid service (per-token). The next logical step is to fine-tune an LLM on a curated dataset to make it more domain-specific.
Depending on the availability of labeled data and availability of computational power, you will have two options: full fine-tuning or Parameter-Efficient fine tuning (PEFT). Full fine tuning the FM requires thousands of examples with high computational power to tune all the weights and biases of the selected model. On the other hand, PEFT can still be performed with tens of examples, and way less computational power is required, making it a more cost effective option. You can use LLMs available from leading AI startups through APIs on Amazon Bedrock, LLMs maintained on SageMaker JumpStart, or any open-source or proprietary LLM model on SageMaker AI to train or fine-tune on custom datasets. If you prefer to keep full control of the training pipeline and you do not need out-of the box capability provided by SageMaker AI, such as data capture, model monitoring, and auto scaling, you may choose to run the workload on EKS or Amazon EC2 DL1 instances a low cost-to-train option for deep learning models.
Retrieval Augmented Generation (RAG)
If you need LLM to create a response that can leverage additional present in the proprietary data, then RAG is the solution. This technique uses semantic search to retrieve relevant and timely context to augment the LLM’s responses for more accurate results. It typically involves creating embeddings and setting up vector stores.
In AWS there are several options to choose from for the vector store, including pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, Amazon OpenSearch Serverless Vector Store, and Amazon Kendra. The benefit of using Amazon Kendra though, is that you can use it to create embeddings as well as to perform semantic search, so you are not dependent on another FM model for embedding. This makes your Gen AI application design much simpler and modular.
It is also worth noting that compared to fine tuning, RAG has higher flexibility and is more cost effective. Many solutions will use a combination of fine-tuning and RAG.
Processing Power
Due to the large size of Large language models and the large datasets required to train or fine-tune a language model, using GPUs is typically required to be able to fit the model or the data in the memory of the instance. AWS has additional options beyond GPUs like the AWS Trainium instance, which is a machine learning (ML) accelerator purposely built for deep learning training of 100B+ parameter models. You can further leverage AWS’s custom silicon for inference with AWS Infrentia2, which provides high performance at the lowest cost for deep learning inference.
In many cases distributing the model or data in several GPU units is required. SageMaker AI supports distributed training on single and multiple instances out of the box. It also supports other frameworks and packages, such as PyTorch DistributedDataParallel (DDP), for which all you need to do is to add additional configuration in your training estimator, as illustrated in the below code snippet.
When using data or model parallelism features in SageMaker AI it is recommended to use instances with high-throughput, low-latency.
Evaluation
If you use LLMs for traditional ML problems such as sentiment analysis or classification in general, you can still use traditional ML performance metrics such as accuracy, precision, and recall. However, to evaluate the LLM's performance, depending on the task you use it for, we may use different metrics. Some of the well known ones are:
- BLEU for machine translation, which compares the precision of n-grams between reference input and generated output
- METEOR for machine translation which is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
- ROUGE for summarization, which considers precision, recall, and F1-Score over a sequence.
- Perplexity for text generation, which measures how well the trained model has learned the distribution in text.
- BERTScore for text generation (embedding similarity)
- CIDEr measures the similarity between a generated caption for an image and the reference captions.
- SPICE evaluates the caption generated for an image by placing importance on capturing details about objects, attributes, relationships (semantic propositions)
- Using larger models with a bigger number of parameters to evaluate responses generated by smaller language models
- Universal benchmarks for a range of tasks: machine translation, summarization, natural language generation, question answering, and more. Some of the famous benchmarks are Pile, GLUE, SuperGLUE, MMLU, LAMBADA, and Big Bench Collaboration Benchmarks,
- Code specific benchmarks: HumanEval, and Mostly Basic Python Programming (MBPP)
- Human in the loop (HITL), where human evaluators provide feedback on the quality of the generated text by the model
Hosting Options
Traditional hosting solutions used for smaller models do not provide the required optimization functionality to host LLM models with optimal inference latency and/or throughput. With SageMaker AI you can use large model inference container images with Deep Java Library (DJL) Serving. SageMaker AI supports model parallelism and inference optimization libraries, including:
- DeepSpeed is an open-source inference optimization library
- Hugging Face Accelerate is an open-source model parallel inference library
- FasterTransformer is an Nvidia open source library for efficiently running transformer-based neural network inference
Additionally, you can host LLMs on SageMaker AI using Hugging Face LLM Deep Learning Container (DLC), enabling high-performance text generation through tensor parallelism, dynamic batching, and model quantization.
Monitoring
You should closely monitor prompts, generated response, and RAG performance in production along with data quality, model quality, infrastructure utilization, endpoint latency and throughput.
It is also important to compare their capabilities to a baseline over time for any drift or eventual degradation of quality in the generated response.
Monitoring metrics
Some of the proposed metrics to monitor prompts/responses are context, relevance to prompt, existence of pattern (repetitiveness), repeatability, readability, token size, jailbreaking, injection, refusal, sentiment, toxicity, and response hallucination. For RAG it is recommended to monitor chunk size, produced embeddings, embedding speed, and semantic search results. All of these aspects need to be monitored over time to facilitate timely remediation through retraining.
Testing strategies
It is also possible to implement a Shadow Testing or AB Testing pipeline to compare different prompts, embeddings, and/or LLMs over time to choose the best one for different use cases. SageMaker AI provides these features out of the box and you can also combine it with other open source tools such as MLflow to get the best of all of them.
SageMaker A2I
In many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain you can use the human-in-the-loop capabilities of SageMaker AI through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production.
Final LLMOps Solution Architecture
If we add all the above components to our existing MLOps solution, it will turn to the following high level architecture, which serves as an end-to-end solution to automate and operationalize the LLM models. Of course, AWS is always releasing new services so you should expect this architecture to change slightly after releases.
The Caylent approach to Generative AI
If you’d like to turn your generative AI vision into reality with strategy workshops, rapid prototyping, and production-ready projects leveraging Caylent’s deep AWS MLOps and GenAI experience, get in touch with our Artificial Intelligence & MLOps experts to see how we can help you design and implement scalable and future-proof solutions. For some examples, take a look at our Generative AI offerings.