Generative AI Essentials and Tech

Generative AI & LLMOps

Understand key concepts like Large Language Models (LLMs), Retrieval Augmented Generation (RAG) & Prompt Engineering to arm you with the knowledge needed to leverage the remarkable capabilities of GenAI.

Generative AI has taken the world by storm. New, AI-powered tools like ChatGPT have become household names seemingly overnight and are generating intense interest. As with any new technology, figuring out where to start can be challenging. My goal in this post is to solve some of that and provide a path forward. I want to start by introducing the concepts and topics you’ll need to gain a general understanding of how these things work. Armed with this knowledge, you can dive deeper into more technical details around these concepts.

What is…

Generative AI (GenAI)

Traditional analytical artificial intelligence (AI) and machine learning (ML) techniques look at a set of provided historical data that is structured and labeled, called training data and use it to make predictions about future data. This includes things like predictive analytics, image recognition, voice-to-text, or the many existing, out-of-the-box capabilities of AWS AI/ML services. GenAI differs from this by taking unstructured training data, processing it in an unsupervised manner, and creating something new with it. We have all seen examples of how ChatGPT can create text output such as stories, poems, or break-up letters by combining words in ways that have possibly never been done before. In addition to text output there are other examples such as GenAI image generators that take a text-based prompt and create a never-before-seen image. 

Behind the scenes GenAI and traditional AI/ML also have some other differences. Traditional AI/ML is generally trained with a very focused set of data and is used to solve a very focused problem. The infamous “hotdog, not hotdog” episode from the TV show Silicon Valley is a prime example. This model was likely trained with a lot of images of hotdogs and other images of not hot dogs in order to train the model. Unlike this fun example, the current GenAI models do not have specific use cases. GenAI models have been trained on massive amounts of data compared to traditional AI/ML. So much so that training them requires specialized hardware which makes them exponentially more expensive to develop. The data set for many of these models is publicly available on the internet.

LLM

Large Language Models (LLM) are the technology underlying text-based GenAI. GPT-4, for instance, is the name of the LLM that powers the popular ChatGPT service. An LLM is a type of model that is trained with a large amount of generally available data that enables it to recognize, translate, predict, or generate various types of content. Today LLMs natively do text input and output, but there are other model types and tools in this space that can do images, videos, and audio. Based on the way LLMs are created (foundationally using neural networks), they have the ability to recognize and summarize large amounts of text. This is why when you ask ChatGPT a question, it can provide a simple answer in return instead of just references to the large amounts of data it was trained with.

These models work by taking in a text-based prompt, which is then broken into “tokens” (see more on tokens below). Those tokens are used as the starting point for the model to generate the response. It generates the response much like predictive text does on many mobile SMS apps, but on a slightly larger scale which it achieves by trying to recognize the context based on the tokens that were passed in. In this way, it is really just a giant auto-complete. As we have all experienced with auto-complete, it can be way off. If an LLM is way off it is likely that the prompt did not provide enough input tokens and therefore not enough context. Another possibility is that the LLM doesn’t have enough or any data on the topic asked so it isn’t able to predict a reasonable next word with high enough confidence.

Examples of common LLMs currently available:

Retrieval Augmented Generation (RAG)

RAG is a method by which you can pass custom or curated data to an LLM (see LLM) in order to provide extra context to improve the answer to your prompt. This is particularly useful if the data you are asking about is something that the LLM doesn't or couldn’t know based on the data that was used to train it. One of the most common use cases for this method is to leverage an internal knowledge base. For example, there is no way for an LLM to know what your internal company HR policies are. But if you can give it custom, curated HR policy information via RAG, it could then answer questions about PTO and medical benefits with high confidence since the additional context was provided. 

RAG is broken down in the diagram above. It often uses a vector data store (see vector store) in order to index all of the documents. This indexing is done by breaking up text into consumable chunks - think 100-1000ish words. It then uses an LLM embedding model (see embeddings) to tokenize the text and saves that as a searchable index.

When a prompt is passed in, the first step is to run the prompt through the same embedding model, search the vector store for chunks of text with a similar semantic meaning (via the previously mentioned index), and then pass those chunks of text as additional context to the LLM. A simple prompt to the LLM might look like the sample below.

```

Using the additional context below, answer the question

Question: How many days of PTO do I get after being here for 5 years

Context: <insert the 3-6 chunks of text returned from the vector store>

```

This method uses a lot more tokens, and is therefore more expensive, but it gives the LLM a lot of additional context to provide the best answer to the question. In addition to the answer from the LLM, the return can also include any metadata stored in the vector store such as the document, url, slack channel, google drive folder, etc that the information came from to enable source attribution and additional investigation if need be.

Embeddings

Embeddings are a way to encode a word or word part based on meaning. This is important because words with similar meaning end up being near each other. The encoding is represented numerically and generally stored in a vector database.

Tokens

Tokens are how inputs and outputs are measured for LLMs. A token has a different meaning between some LLMs, but the easiest way to think about it is that a token is roughly equivalent to an input word. However if the word is long or a compound word it might be split into word parts and each would be a token. Or if there are several short words (single or double letter words) they might be combined into one. Once again, this is based on the LLM that you use.

Some common uses and things to keep in mind concerning tokens

  • LLM inputs and outputs always have a token limit but the limits vary per model
  • Tokens are often the unit of measurement for billing purposes
  • Generated tokens per second is how the speed and performance of an LLM is measured
  • Prompt optimization generally focuses on getting the best results for the least amount of tokens (or cost)
  • LLM optimization focuses on the speed at which tokens can be generated

Vector Stores

Vector Stores are specialized databases that allow you to search data in a way that is generally useful for GenAI. The method of search requires some complex math, but the gist of it is finding things that have a similar meaning. The meaning of something is determined by using embeddings (see embeddings). Those embeddings are stored in a specialized database index and then queried.

Another way to think about vector stores is that they are standard databases with standard fields, however, it has a specialized index (created using embeddings) and that specialized index is the main way that you query for data in this type of database. Other columns are simply the metadata.

Prompt Engineering

Prompt engineering is the science (although it is likely more of an art) of structuring your input, also known as a prompt, to give the LLM (see LLM) the most information using the least amount of tokens, while getting a correct response. Often when you don’t get back exactly what you wanted it is because you should have added additional tokens (or context) to your prompt.

Some simple ways to do this are below

  • Give the prompt a persona - “As a developer…”, “As a SQL Server administrator…”, or “As a children's book author…”
  • Give the prompt tokens about what is expected out of the response - “Respond in one word…”, “Using a 1st-grade reading level…”, “In the style of Shakespeare…”
  • Give the prompt examples of entire questions and answer to help it learn
  • Try to NOT use negatives - instead tell it what you want it to do

Natural Language Processing

Natural language processing (NLP) is the concept of giving a computer the ability to understand and process written text. As an example, asking a computer “if I have twelve apples and I give two to my brother and two to my sister, how many apples do I have left?” The computer is able to process the natural language to understand what is happening (12 - 2 - 2), formulate an answer (8), and give me a natural language response (eight apples).

Natural language processing and responses are one of the big draws to GenAI and how it can unlock a lot of potential for users.

Conclusion

Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there. At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas we can implement our Generative AI Knowledge Base Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. Finally, Caylent’s Generative AI Flight Plan Catalyst, will help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way.



Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Generative AI & LLMOps
Clayton Davis

Clayton Davis

Clayton Davis is the Director of the Cloud Native Applications practice at Caylent. His passion is partnering with potential clients and helping them realize how cloud-native technologies can help their businesses deliver more value to their customers. His background spans the landscape of AWS and IT, having spent most of the last decade consulting clients across a plethora of industries. His technical background includes application development, large scale migrations, DevOps, networking and technical product management. Clayton currently lives in Milwaukee, WI and as such enjoys craft beer, cheese, and sausage.

View Clayton's articles

Related Blog Posts

Healthcare's Digital Evolution: From Manual Charts to Generative AI Solutions

Learn how Generative AI is poised to transform healthcare by addressing technological challenges, reducing administrative burdens, enhancing clinical decision-making, and creating more personalized, efficient patient care experiences.

Generative AI & LLMOps

Experiences as a Tech Intern at Caylent

Read about the experiences our summer technology fellow had at Caylent, where she explored cloud computing, generative AI, web development, and more.

Culture
Generative AI & LLMOps

OpenAI vs Bedrock: Optimizing Generative AI on AWS

The AI industry is growing rapidly and a variety of models now exist to tackle different use cases. Amazon Bedrock provides access to diverse AI models, seamless AWS integration, and robust security, making it a top choice for businesses who want to pursue innovation without vendor lock-in.

Generative AI & LLMOps