Caylent Accelerate™

Using Langchain and OpenSearch for RAG on AWS

Generative AI & LLMOps

Learn how to build a RAG-based GenAI bot on AWS using LangChain, through our step-by-step example.

Retrieval Augmented Generation (RAG) is a great solution when you want your responses to include supplemental data that wasn’t part of the original LLM training data set or when you want to include data that is rapidly changing. Some of the more common RAG use cases deal with including internal corporate knowledge bases in the LLM to generate more targeted, customized responses.

If you are unfamiliar with GenAI, please view our earlier blog post that covers some of the basic terminology associated with the rapidly evolving field. 

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that combines the power of large language models (LLMs) with external knowledge retrieval. RAG enhances the capabilities of traditional LLMs by allowing them to access and utilize up-to-date or domain-specific information that may not be part of their original training data.

Unlike traditional models that rely solely on their pre-trained knowledge, RAGs can dynamically retrieve relevant information from external sources before generating responses. This approach significantly improves the accuracy, relevance, and timeliness of the model's outputs.

RAGs are typically used for tasks that require access to specific, current, or proprietary information, such as:

  1. Question-answering systems with access to company knowledge bases
  2. Chatbots that can provide up-to-date product information
  3. Content generation tools that incorporate the latest industry trends
  4. Personalized recommendation systems that consider user-specific data

How does RAG work?

RAG operates by integrating a retrieval mechanism with a generative language model. The process typically involves three main steps: indexing, retrieval, and generation.

Indexing

Indexing is the first step in the RAG process. It involves collecting and organizing relevant documents or data sources, then breaking them down into smaller, manageable chunks. These chunks are then transformed into vector representations (embeddings) using techniques like word embeddings or sentence encoders. Finally, these embeddings are stored in a vector database or search engine for efficient retrieval. This step ensures that the external knowledge is properly structured and can be quickly accessed when needed, laying the groundwork for effective information retrieval in the later stages of the RAG process.

Retrieval

The retrieval phase occurs when a query or prompt is input into the system. During this stage, the input query is converted into a vector representation using the same embedding technique used for indexing. The system then performs a similarity search in the vector database to find the most relevant document chunks. A set number of the most similar chunks are retrieved, which will serve as additional context for the generation step. This process allows the system to dynamically pull relevant information based on the specific input, rather than relying solely on the model's pre-trained knowledge, ensuring that the most up-to-date and pertinent information is used in generating the response.

Generation

The generation phase is where the LLM produces the final output. In this step, the original input query is combined with the retrieved relevant chunks, and this combined information is formatted into a prompt that the LLM can understand. This prompt is then passed to the LLM, which generates a response based on both its pre-trained knowledge and the additional context provided. Optionally, the generated response may undergo post-processing to ensure coherence and relevance. This step allows the model to produce more informed, accurate, and up-to-date responses by leveraging both its inherent knowledge and the retrieved information, resulting in outputs that are more contextually appropriate and factually current.

What is LangChain?

LangChain is an open-source framework designed to simplify the development of applications using large language models (LLMs). It provides a set of tools and abstractions that make it easier to build complex AI applications, including those that use Retrieval-Augmented Generation (RAG).

LangChain is popular because it simplifies AI development and makes it more flexible. By abstracting away much of the complexity of working with LLMs and other AI tools, it helps developers focus on building applications. It also integrates easily with different LLMs, databases, and other tools, allowing developers to switch between components without rewriting large parts of their code. Additionally, LangChain provides a standardized approach that makes collaboration easier. With built-in features like prompt templating, chain of thought reasoning, and agent-based systems, it offers powerful tools to streamline AI workflows.

By using LangChain, developers can more quickly and easily build sophisticated AI applications, including those that leverage RAG techniques

What is OpenSearch?

OpenSearch is a search and analytics database used to store and manage data. It serves as a powerful vector database for RAG implementations on AWS. OpenSearch efficiently manages vector embeddings from documents, enabling semantic search beyond simple keyword matching.

In applications using RAG, OpenSearch handles the critical storage and retrieval functions. It indexes vector representations of documents for fast similarity searches. OpenSearch Serverless automatically scales with your workload, eliminating the need for manual provisioning.

OpenSearch works seamlessly with LangChain through the OpenSearchVectorSearch class. LangChain uses OpenSearch to store embeddings and retrieve relevant document chunks during queries. Together, they form the backbone of robust, scalable RAG solutions on AWS.

Cost optimization and scaling with OpenSearch

OpenSearch Serverless automatically scales with your workload, with indexing and search capacity configured separately. This eliminates the need to provision specific instance types. However, monitor your usage to avoid unexpected costs as your document collection grows.

When scaling with OpenSearch, consider implementing caching for frequent queries. This reduces both embedding generation and vector search costs. Popular questions can be served faster while reducing overall API calls.

How to build a RAG with LangChain on AWS

In this post, we will walk through an example showing how to build a RAG-based GenAI bot using OpenSearch Serverless as the vector store. Starting at the beginning we will see how to index data into OpenSearch, how to query that data from OpenSearch, and how to pass all of that data into an LLM for a plain text response.

Step 0 - Create Amazon OpenSearch Serverless (AOSS) 

For this example we will use SAM/CloudFormation, below you will find a very basic template to create the OpenSearchServerless Collection. To create the collection, you need to create a policy for data access, encryption, and network access. For simplicity, we are allowing our SSO user to access the collection from the internet and are using an AWS-supplied key for encryption.

Be sure to update the template below to match your naming convention and to give roles in your specific account access.

rAOSS:
  Type: 'AWS::OpenSearchServerless::Collection'
  Properties:
    Name: aoss-example
    Type: VECTORSEARCH
    Description: Vector search demo
  DependsOn: rEncryptionPolicy
rDataAccessPolicy:
  Type: 'AWS::OpenSearchServerless::AccessPolicy'
  Properties:
    Name: aoss-example-access-policy
    Type: data
    Description: Access policy for aoss-example collection
    Policy: !Sub >-
[{"Description":"Access for SSO user","Rules":[{"ResourceType":"index","Resource":["index/*/*"],"Permission":["aoss:*"]}, {"ResourceType":"collection","Resource":["collection/aoss-example"],"Permission":["aoss:*"]}], "Principal":["arn:aws:sts::123456789012:assumed-role/AWSReservedSSO_AWSAdministratorAccess_abcdefg12345/clayton@claytondavis.dev"]}]
rNetworkPolicy:
  Type: 'AWS::OpenSearchServerless::SecurityPolicy'
  Properties:
    Name: aoss-example-network-policy
    Type: network
    Description: Network policy for aoss-example collection
    Policy: >-
[{"Rules":[{"ResourceType":"collection","Resource":["collection/aoss-example"]}, {"ResourceType":"dashboard","Resource":["collection/aoss-example"]}],"AllowFromPublic":true}]
rEncryptionPolicy:
  Type: 'AWS::OpenSearchServerless::SecurityPolicy'
  Properties:
    Name: aoss-example-security-policy
    Type: encryption
    Description: Encryption policy for aoss-example collection
    Policy: >-
{"Rules":[{"ResourceType":"collection","Resource":["collection/aoss-example"]}],"AWSOwnedKey":true}

Step 1 - Connect to OpenSearch

Before we can do anything else, we need to connect to the collection using our AWS credentials. One simple method is to simply copy the temporary credentials from AWS SSO into the console to run this script and then use boto3 to create the authentication. If you aren’t using AWS, this step may vary.

from requests_aws4auth import AWS4Auth
from opensearchpy import OpenSearch, RequestsHttpConnection
import boto3


region = 'us-east-1'
service = 'aoss'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)


vector = OpenSearch(
   hosts = [{'host': aoss-example-1234.us-east-1.aoss.amazonaws.com', 'port': 443}],
   http_auth = awsauth,
   use_ssl = True,
   verify_certs = True,
   http_compress = True,
   connection_class = RequestsHttpConnection
)

Step 2 - Create the index

Once we have the connection, we can create our vector index.

index_body = {
  'settings': {
    "index.knn": True
  },
  "mappings": {
    "properties": {
      "osha_vector": {
        "type": "knn_vector",
        "dimension": 1536,
        "method": {
          "engine": "faiss",
          "name": "hnsw",
          "space_type": "l2"
        }
      }
    }
  }
}


response = vector.indices.create('aoss-index', body=index_body)

There are a few settings in this index explained briefly below

  • Type - knn_vector is the type of index that allows you to perform k-nearest neighbor (k-NN) searches on your data. This is required for vector searches.
  • Dimension - this needs to match your embedding. The OpenAI embedding defaults to text-embedding-ada-002 as of this writing and has 1536 dimensions. Dimensions, in this case, give you more points to search on but also directly relate to the amount of data you are storing and the time it takes to query.
  • Engine - The approximate k-NN library to use for indexing and search. Facebook AI Similarity Search (FAISS) is the engine we are using here. As of this writing, Amazon OpenSearch Serverless Collections only support Hierarchical Navigable Small World (HNSW) (below) and FAISS (you can see in the limitation section on the Amazon docs). Other options in OpenSearch include Non-Metric Space Library (nmslib) and Apache Lucene.
  • Name - This is the identifier for the nearest neighbor method that we are using. Hierarchical Navigable Small World (HNSW), as mentioned above, is the only one supported by Amazon OpenSearch Serverless today. Other options include Inverted File System (IVF) or Inverted File System with Product Quantization (IVFPQ/IVFQ). While HNSW is generally a faster algorithm, it does so at the cost of memory consumption. To learn more, you can take a look at this blog post by AWS on choosing the right algorithm
  • Space type - The space type used to calculate the distance/similarity between vectors. Here, we use l2. There are several other options that get deep into the math of vectors and their relation to each other. Other options include innerproduct, cosinesimil, l1, and linf.

Depending on what options you choose above, you may also have additional settings that you can set to help tune your index for performance, resource consumption, or to optimize for the particular type of data.

Step 3 - Index documents

For a good set of sample documents, we’ll use a large batch of publicly accessible OSHA documents copied into an S3 bucket. For your real-world use case, you might use internal company data, knowledge bases, PDF reports, etc. The possibilities here are endless. In this example, we’ll use LangChain’s S3DocumentLoader to help break up and index the documents, but they also have document loaders for 100+ different sources that we can use to replicate a similar process.

One of the first pieces that we need to configure is the embeddings. Here we are using OpenAI embeddings.

from langchain.embeddings.openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings()

We also need to connect to our OpenSearch collection index that we created using the LangChain OpenSearchVectorSearch. You will notice that we specify the embeddings we are using in this connection so that as we upload documents they will be indexed using those embeddings.

from langchain.vectorstores import OpenSearchVectorSearch


vector = OpenSearchVectorSearch(
  embedding_function = embeddings,
  index_name = 'aoss-index',
  http_auth = awsauth,
  use_ssl = True,
  verify_certs = True,
  http_compress = True, # enables gzip compression for request bodies
  connection_class = RequestsHttpConnection,
  opensearch_url=opensearch_url="https://aoss-example-1234.us-east-1.aoss.amazonaws.com"
)

Next, I found all of my S3 documents dynamically.

import boto3


s3_resource = boto3.resource('s3', region_name="us-east-1")
s3_bucket = s3_resource.Bucket("my-sample-osha-bucket")
s3_bucket_files = []
for s3_object in s3_bucket.objects.all():
    <more magic for here below>

Then, with each S3 key, I proceeded to use the LangChain S3FileLoader to load the file, split it into chunks, run it through the embeddings, and then load it into the vector store. It looks something like this.

loader = S3FileLoader(bucket_name, file_key)
text_splitter = RecursiveCharacterTextSplitter(
  chunk_size = 1000,
  chunk_overlap = 20,
  length_function = len,
  is_separator_regex = False,
)
pages = loader.load_and_split(text_splitter=text_splitter)


vector.add_documents(
  documents = pages,
  vector_field = "osha_vector"
)

This results in all of our chunks of documents getting uploaded with the chunk of text, the embeddings for determining similarity, and metadata about which document the text was taken from.

At this point, we’ve set up the environment and we can perform queries on it as many times as we want.

Step 4 - Query documents with similarity search

This step is triggered when someone asks a query of your system. The question is processed by the same embeddings you used to load your documents. It then searches through your vector store for similar chunks of text. LangChain abstracts some of this for us and provides us with a function.

First things first, make sure you have a connection to your collection and index using the same embeddings as we did in step 3. This connection can be the exact same as the one in step 3.

Once you have that connection, you can pass your question in. You also indicate which column to store your vector index, which column contains your text, and where your metadata is stored. The last two values make up your return data.

docs = vector.similarity_search_by_vector(
  "What is the standard height where fall protection is required?",
  vector_field="osha_vector",
  text_field="text",
  metadata_field="metadata",
)


​​docs_dict = [{"page_content": doc.page_content, "metadata": doc.metadata} for doc in docs]
data = ""
for doc in docs_dict:
  data += doc['page_content'] + "\n\n"

Step 5 - Send data to the LLM

Finally, now that we have our question and additional context from our vector store, we can package up and send all of this to the LLM for a response.

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain


llm = OpenAI()
prompt = PromptTemplate(
  input_variables=["question", "data"],
  template="""Using the data below, answer the question provided.
  question: {question}
  data: {data}
  """,
)


chain = LLMChain(llm=llm, prompt=prompt)
llm_return_data = chain.run({'question': question, 'data': data})

The LLM will return a plain text response. We can couple that response with the metadata we pulled from our vector store to not only provide our user with an answer but also link to documents where additional data can be found.

Using Amazon Bedrock instead of OpenAI

Amazon Bedrock offers powerful capabilities for RAG implementations. You can use Amazon Titan Embeddings via Bedrock instead of OpenAI embeddings to calculate embeddings for text. This provides a fully AWS-native solution for your RAG pipeline.

For the generation phase, models like Claude 4 or Amazon Nova can replace OpenAI. Simply update your LLM connection in LangChain to use these models. AWS credentials can authenticate to both OpenSearch and Bedrock services seamlessly.

Advanced prompt engineering techniques

Effective prompt engineering is crucial for quality RAG responses. Your prompt template should clearly instruct the LLM to use retrieved context for answering. To avoid hallucinations caused by data missing from your knowledge base, consider including specific instructions about handling cases when context is insufficient.

Dynamic prompts can adapt based on the quality of retrieved documents. For example, you might use different templates depending on similarity scores. This ensures your system gracefully handles both information-rich and information-poor scenarios.

prompt = PromptTemplate(
  input_variables=["question", "data"],
  template="""Using the data below, answer the question provided.
  question: {question}
  data: {data}
  """
)

Security concerns with OpenSearch 

Protecting proprietary information is essential in any information system. To avoid unauthorized access to OpenSearch, use AWS authentication and encryption policies to secure your OpenSearch collection. The CloudFormation templates used in this article demonstrate setting proper network and data access restrictions.

In addition to information security measures, consider what data you're indexing carefully. Sensitive information requires appropriate handling within your embedding pipeline. AWS provides compliance frameworks that integrate with the services mentioned.

The CloudFormation template in Step 0 shows encryption policy configuration. Setting AWSOwnedKey to true enables AWS-managed encryption. For stricter requirements, configure customer-managed keys instead

The Caylent approach to Generative AI

Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there.

At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas we can implement our Generative AI Knowledge Base Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries.

Finally, Caylent’s Generative AI Flight Plan Catalyst, will help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way.

FAQs about using Langchain and OpenSearch for RAG on AWS

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that combines large language models with external knowledge retrieval. Unlike traditional models that rely solely on pre-trained knowledge, RAG dynamically retrieves relevant information from external sources before generating responses. 

This approach significantly improves accuracy, relevance, and timeliness of outputs, making it ideal for tasks that require specific, current, or proprietary information like company knowledge bases or up-to-date product information.

How does RAG work?

RAG operates through three main steps: indexing, retrieval, and generation. During indexing, relevant documents are collected, broken into chunks, and transformed into vector embeddings stored in a database. The retrieval phase occurs when a query is input—the system converts it into a vector and performs a similarity search to find relevant document chunks. Finally, in the generation phase, the original query and retrieved chunks are combined into a prompt for the LLM, which produces a response based on both its pre-trained knowledge and the additional context.

What is LangChain?

LangChain is an open-source framework designed to simplify development of applications using large language models. It provides tools and abstractions that make building complex AI applications easier, including those using Retrieval-Augmented Generation. LangChain is popular because it abstracts away complexity, integrates easily with different LLMs and databases, and offers standardized approaches with built-in features like prompt templating and agent-based systems that streamline AI workflows.

Why choose LangChain and OpenSearch for implementing RAG?

LangChain simplifies integration with various AI models and OpenSearch efficiently manages vector data, enabling robust, scalable RAG solutions on AWS. LangChain provides built-in document loaders for 100+ different sources, making data ingestion straightforward. OpenSearch Serverless further reduces operational overhead with its automatic scaling capabilities.

How does LangChain interact with AWS OpenSearch?

LangChain uses AWS OpenSearch as a vector store to efficiently index, retrieve, and provide relevant documents during query handling. The OpenSearchVectorSearch class handles authentication and facilitates similarity searches with vector embeddings. This integration enables semantic search capabilities that go beyond simple keyword matching.

Can I use Amazon Bedrock models within a LangChain-OpenSearch solution?

Yes, LangChain easily integrates with Amazon Bedrock, enabling you to leverage powerful models like Claude or Llama 2 alongside OpenSearch. This creates a fully AWS-native solution for your entire RAG pipeline. You can use Bedrock's Titan Embeddings for vector creation and Bedrock's LLMs for the generation phase of your application.

How do you use OpensSearch to build a RAG chatbot on AWS?

Building a RAG chatbot on AWS involves several steps, beginning with creating an Amazon OpenSearch Serverless collection for vector storage. You then connect to OpenSearch using AWS credentials, create a vector index with appropriate settings (like dimension size and search algorithm), and index your documents using LangChain's document loaders. 

When a query comes in, you process it through the same embeddings used for indexing, perform a similarity search to find relevant document chunks, and finally combine the question and retrieved context into a prompt for the LLM to generate a response.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Generative AI & LLMOps
Clayton Davis

Clayton Davis

Clayton Davis is the Director of the Cloud Native Applications practice at Caylent. His passion is partnering with potential clients and helping them realize how cloud-native technologies can help their businesses deliver more value to their customers. His background spans the landscape of AWS and IT, having spent most of the last decade consulting clients across a plethora of industries. His technical background includes application development, large scale migrations, DevOps, networking and technical product management. Clayton currently lives in Milwaukee, WI and as such enjoys craft beer, cheese, and sausage.

View Clayton's articles

Related Blog Posts

Why Healthcare and Life Sciences Need Agentic AI Architectures

Explore how agentic AI architectures can address the complexity, uncertainty, and personalization needs of modern healthcare by mirroring medical team dynamics, enabling dynamic reasoning, mitigating bias, and delivering more context-aware and trustworthy medical insights.

Generative AI & LLMOps

How AI is Revolutionizing Database Migration: From Year-long Projects to Quarterly Wins

AI-powered automation is transforming database migrations. Read expert insights on faster, safer, and more cost-effective modernization for enterprises.

Generative AI & LLMOps
Databases

Architecting GenAI at Scale: Lessons from Amazon S3 Vector Store and the Nuances of Hybrid Vector Storage

Explore how AWS S3 Vector Store is a major turning point in large-scale AI infrastructure and why a hybrid approach is essential for building scalable, cost-effective GenAI applications.

Generative AI & LLMOps