Retrieval Augmented Generation (RAG) is a great solution when you want your responses to include supplemental data that wasn’t part of the original LLM training data set or when you want to include data that is rapidly changing. Some of the more common RAG use cases deal with including internal corporate knowledge bases in the LLM to generate more targeted, customized responses.
If you are unfamiliar with GenAI, please view our earlier blog post that covers some of the basic terminology associated with the rapidly evolving field.
What is RAG?
Retrieval-Augmented Generation (RAG) is an AI framework that combines the power of large language models (LLMs) with external knowledge retrieval. RAG enhances the capabilities of traditional LLMs by allowing them to access and utilize up-to-date or domain-specific information that may not be part of their original training data.
Unlike traditional models that rely solely on their pre-trained knowledge, RAGs can dynamically retrieve relevant information from external sources before generating responses. This approach significantly improves the accuracy, relevance, and timeliness of the model's outputs.
RAGs are typically used for tasks that require access to specific, current, or proprietary information, such as:
- Question-answering systems with access to company knowledge bases
- Chatbots that can provide up-to-date product information
- Content generation tools that incorporate the latest industry trends
- Personalized recommendation systems that consider user-specific data
How does RAG work?
RAG operates by integrating a retrieval mechanism with a generative language model. The process typically involves three main steps: indexing, retrieval, and generation.
Indexing
Indexing is the first step in the RAG process. It involves collecting and organizing relevant documents or data sources, then breaking them down into smaller, manageable chunks. These chunks are then transformed into vector representations (embeddings) using techniques like word embeddings or sentence encoders. Finally, these embeddings are stored in a vector database or search engine for efficient retrieval. This step ensures that the external knowledge is properly structured and can be quickly accessed when needed, laying the groundwork for effective information retrieval in the later stages of the RAG process.
Retrieval
The retrieval phase occurs when a query or prompt is input into the system. During this stage, the input query is converted into a vector representation using the same embedding technique used for indexing. The system then performs a similarity search in the vector database to find the most relevant document chunks. A set number of the most similar chunks are retrieved, which will serve as additional context for the generation step. This process allows the system to dynamically pull relevant information based on the specific input, rather than relying solely on the model's pre-trained knowledge, ensuring that the most up-to-date and pertinent information is used in generating the response.
Generation
The generation phase is where the LLM produces the final output. In this step, the original input query is combined with the retrieved relevant chunks, and this combined information is formatted into a prompt that the LLM can understand. This prompt is then passed to the LLM, which generates a response based on both its pre-trained knowledge and the additional context provided. Optionally, the generated response may undergo post-processing to ensure coherence and relevance. This step allows the model to produce more informed, accurate, and up-to-date responses by leveraging both its inherent knowledge and the retrieved information, resulting in outputs that are more contextually appropriate and factually current.
What is Langchain?
LangChain is an open-source framework designed to simplify the development of applications using large language models (LLMs). It provides a set of tools and abstractions that make it easier to build complex AI applications, including those that use Retrieval-Augmented Generation (RAG).
LangChain is popular because it simplifies AI development and makes it more flexible. By abstracting away much of the complexity of working with LLMs and other AI tools, it helps developers focus on building applications. It also integrates easily with different LLMs, databases, and other tools, allowing developers to switch between components without rewriting large parts of their code. Additionally, LangChain provides a standardized approach that makes collaboration easier. With built-in features like prompt templating, chain of thought reasoning, and agent-based systems, it offers powerful tools to streamline AI workflows.
By using LangChain, developers can more quickly and easily build sophisticated AI applications, including those that leverage RAG techniques
How to build a RAG with Langchain on AWS
In this post, we will walk through an example showing how to build a RAG-based GenAI bot using OpenSearch Serverless as the vector store. Starting at the beginning we will see how to index data into OpenSearch, how to query that data from OpenSearch, and how to pass all of that data into an LLM for a plain text response.
Step 0 - Create Amazon OpenSearch Serverless (AOSS)
For this example we will use SAM/CloudFormation, below you will find a very basic template to create the OpenSearchServerless Collection. To create the collection, you need to create a policy for data access, encryption, and network access. For simplicity, we are allowing our SSO user to access the collection from the internet and are using an AWS-supplied key for encryption.
Be sure to update the template below to match your naming convention and to give roles in your specific account access.