Prompt Engineering and Prompt Catalogs
The simplest and cheapest approach to harness the capabilities of an LLM in your Gen AI application is to use “In-context learning”. It provides instructions at inference time using techniques like “zero-shot”, “one-shot”, and “few-shot” learning depending on how many demonstrations are provided as proposed in the paper called “Language Models are Few-Shot Learners”. There are also more advanced techniques such as chain-of-thought where steps of reasoning are provided when prompting the language model to be able to perform sophisticated tasks that require reasoning. When the number of projects utilizing LLMs in an organization grows, it is recommended that all the engineered prompts are cataloged and versioned for repeatability. This makes it simple to reuse prompts for additional projects and, more importantly, to be monitored for any drift over time.
Model Fine-Tuning
Although using Prompt Engineering is comparably easiest to implement, it can only provide initial instructions to the model. Moreover, all the LLMs are constrained by the number of tokens that can be passed to them at the time of inference and using a long prompt can increase the total cost of ownership (TCO) if a paid service (per-token) is used. The next logical step is to fine tune an LLM on a curated dataset to make it more domain specific. Depending on the availability of labeled data and availability of computational power you will have two options: full fine-tuning or Parameter-Efficient fine tuning (PEFT). Full fine tuning the FM requires thousands of examples with high computational power to tune all the weights and biases of the selected model. On the other hand, PEFT can still be performed with tens of examples and way less computational power required making it a more cost effective option. Some of the common PEFT techniques include:
- Low-Rank Adaptation (LoRA) which reduces the number of trainable parameters for downstream tasks by freezing the pre-trained model weights. This method involves adding adapters after each of the transformer sub-layers.
- QLoRA extends LoRA by quantizing original weight values, from high-resolution data types, such as Float32, to lower-resolution data types like int4 resulting in reduced memory demand.
- Prompt-tuning / Prefix-tuning by attaching a soft prompt to the top most embedding layer in the beginning of the transformer layers and training only the added prompt tokens while keeping trained LLM frozen.
- LLaMa-Adapter a modified version of prefix tuning in which soft prompts are added at the N top-most transformer layers beside initializing the parameters near the attention mechanism to zero instead of at random to avoid “corruption” of LLaMa’s original knowledge.
You can use LLMs available from leading AI startups through APIs on Amazon Bedrock, LLMs maintained on SageMaker JumpStart, or any open-source or proprietary LLM model on SageMaker to train or fine-tune on custom datasets. If you prefer to keep full control on the training pipeline and you do not need out-of the box capability provided by SageMaker such as data capture, model monitoring, auto scaling you may choose to run the workload on EKS or Amazon EC2 DL1 instances a low cost-to-train option for deep learning models.
Retrieval Augmented Generation (RAG)
If you need LLM to create a response that can leverage additional context present in the proprietary data then RAG is the solution. This technique uses semantic search to retrieve relevant and timely context to augment the LLM’s responses for more accurate results. It typically involves creating embeddings and setting up vector stores. In AWS there are several options to choose for the vector store, including pgvector in Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL, Amazon OpenSearch Serverless Vector Store, and Amazon Kendra. The benefit of using Amazon Kendra though is that it can be used both to create embeddings as well as to perform semantic search, so you are not dependent on another FM model for embedding. This makes your Gen AI application design much simpler and modular. It is also worth noting that compared to fine tuning, RAG has higher flexibility and is more cost effective. Many solutions will use a combination of fine-tuning and RAG.
Processing Power
Due to the large size of Large language models and the large datasets required to train or fine-tune a language model, using GPUs is typically required to be able to fit the model or the data in the memory of the instance. AWS has additional options beyond GPUs like the AWS Trainium instance which is a machine learning (ML) accelerator purposely built for deep learning training of 100B+ parameter models. You can further leverage AWS’s custom silicon for inference with AWS Infrentia2, which provides high performance at the lowest cost for deep learning inference.
In many cases distributing the model or data in several GPU units is required. SageMaker supports distributed training on single and multiple instances out of the box. It also supports other frameworks and packages such as PyTorch DistributedDataParallel (DDP) for which all you need to do is to add additional configuration in your training estimator as illustrated in the below code snippet.
When using data or model parallelism features in SageMaker it is recommended to use instances with high-throughput, low-latency.
Evaluation
If you are using LLMs for traditional ML problems such as sentiment analysis or classification in general you can still use traditional ML performance metrics such as accuracy, precision, and recall. However, to evaluate the LLM's performance depending on the task it is used for, we may use different metrics. Some of well known ones are:
- BLEU for machine translation which compares precision of n-grams between reference input and generated output
- METEOR for machine translation which is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision.
- ROUGE for summarization which considers precision, recall, and F1-Score over a sequence.
- Perplexity for text generation which measures how well the trained model has learned the distribution in text.
- BERTScore for text generation (embedding similarity)
- CIDEr that measures the similarity between a generated caption for an image and the reference captions.
- SPICE that evaluates caption generated for an image by placing importance on capturing details about objects, attributes, relationships (semantic propositions)
- Using larger models with a bigger number of parameters to evaluate responses generated by smaller language models
- Universal benchmarks for a range of tasks: machine translation, summarization, natural language generation, question answering, and more. Some of the famous benchmarks are Pile, GLUE, SuperGLUE, MMLU, LAMBADA, and Big Bench Collaboration Benchmarks,
- Code specific benchmarks: HumanEval, and Mostly Basic Python Programming (MBPP)
- Human in the loop (HITL) where human evaluators provide feedback on the quality of the generated text by the model
Hosting Options
Traditional hosting solutions used for smaller models do not provide required optimization functionality to host LLM models with optimal inference latency and/or throughput. With SageMaker you can use large model inference container images with Deep Java Library (DJL) Serving. SageMaker supports model parallelism and inference optimization libraries including:
Additionally you can host LLMs on SageMaker using Hugging Face LLM Deep Learning Container (DLC) enabling high-performance text generation through tensor parallelism, dynamic batching, and model quantization.
Monitoring
Prompts, generated response, and RAG performance need to be closely monitored in production along with data quality, model quality, infrastructure utilization, endpoint latency and throughput.
It is also important to compare their capabilities to a baseline overtime for any drift or eventual degradation of quality in the generated response.
Some of the proposed metrics to monitor prompts/responses are: context, relevance to prompt, existence of pattern (repetitiveness), repeatability, readability, token size, jailbreaking, injection, refusal, sentiment, toxicity, response hallucination. For RAG it is recommended to monitor chunk size, produced embeddings, embedding speed, and semantic search results. All of these aspects need to be monitored over time to facilitate timely remediation through retraining.
It is also possible to implement a Shadow Testing or AB Testing pipeline to compare different prompts, embeddings, and/or LLMs over time to choose the best one for different use cases. SageMaker provides these features out of the box and they can also be combined with other open source tools such as MLflow to get the best of all of them.
In many cases where automated monitoring is not possible or relevant benchmark is not available for your specific domain you can use the human-in-the-loop capabilities of SageMaker through SageMaker Augmented AI (A2I) to implement human review of LLM predictions in production.