RAG is broken down in the diagram above. It often uses a vector data store (see vector store) in order to index all of the documents. This indexing is done by breaking up text into consumable chunks - think 100-1000ish words. It then uses an LLM embedding model (see embeddings) to tokenize the text and saves that as a searchable index.
When a prompt is passed in, the first step is to run the prompt through the same embedding model, search the vector store for chunks of text with a similar semantic meaning (via the previously mentioned index), and then pass those chunks of text as additional context to the LLM. A simple prompt to the LLM might look like the sample below.
```
Using the additional context below, answer the question
Question: How many days of PTO do I get after being here for 5 years
Context: <insert the 3-6 chunks of text returned from the vector store>
```
This method uses a lot more tokens, and is therefore more expensive, but it gives the LLM a lot of additional context to provide the best answer to the question. In addition to the answer from the LLM, the return can also include any metadata stored in the vector store such as the document, url, slack channel, google drive folder, etc that the information came from to enable source attribution and additional investigation if need be.
Embeddings
Embeddings are a way to encode a word or word part based on meaning. This is important because words with similar meaning end up being near each other. The encoding is represented numerically and generally stored in a vector database.
Tokens
Tokens are how inputs and outputs are measured for LLMs. A token has a different meaning between some LLMs, but the easiest way to think about it is that a token is roughly equivalent to an input word. However if the word is long or a compound word it might be split into word parts and each would be a token. Or if there are several short words (single or double letter words) they might be combined into one. Once again, this is based on the LLM that you use.
Some common uses and things to keep in mind concerning tokens
- LLM inputs and outputs always have a token limit but the limits vary per model
- Tokens are often the unit of measurement for billing purposes
- Generated tokens per second is how the speed and performance of an LLM is measured
- Prompt optimization generally focuses on getting the best results for the least amount of tokens (or cost)
- LLM optimization focuses on the speed at which tokens can be generated
Vector Stores
Vector Stores are specialized databases that allow you to search data in a way that is generally useful for GenAI. The method of search requires some complex math, but the gist of it is finding things that have a similar meaning. The meaning of something is determined by using embeddings (see embeddings). Those embeddings are stored in a specialized database index and then queried.
Another way to think about vector stores is that they are standard databases with standard fields, however, it has a specialized index (created using embeddings) and that specialized index is the main way that you query for data in this type of database. Other columns are simply the metadata.
Prompt Engineering
Prompt engineering is the science (although it is likely more of an art) of structuring your input, also known as a prompt, to give the LLM (see LLM) the most information using the least amount of tokens, while getting a correct response. Often when you don’t get back exactly what you wanted it is because you should have added additional tokens (or context) to your prompt.
Some simple ways to do this are below
- Give the prompt a persona - “As a developer…”, “As a SQL Server administrator…”, or “As a children's book author…”
- Give the prompt tokens about what is expected out of the response - “Respond in one word…”, “Using a 1st-grade reading level…”, “In the style of Shakespeare…”
- Give the prompt examples of entire questions and answer to help it learn
- Try to NOT use negatives - instead tell it what you want it to do
Natural Language Processing
Natural language processing (NLP) is the concept of giving a computer the ability to understand and process written text. As an example, asking a computer “if I have twelve apples and I give two to my brother and two to my sister, how many apples do I have left?” The computer is able to process the natural language to understand what is happening (12 - 2 - 2), formulate an answer (8), and give me a natural language response (eight apples).
Natural language processing and responses are one of the big draws to GenAI and how it can unlock a lot of potential for users.
Conclusion
Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there. At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas we can implement our Generative AI Knowledge Base Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. Finally, Caylent’s Generative AI Flight Plan Catalyst, will help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way.