NLP for Database Queries Guide

Artificial Intelligence & MLOps

Learn how to implement Natural Language Processing (MLP) using LangChain and an OpenAI API through our comprehensive step-by-step guide.

Data is, more often than not, stored in a database. This makes sense - a database provides structure and structured data is easier to store, query, and optimize. In fact, the entire industry of business intelligence (BI) exists to help non-technical users unlock meaningful data from these structured datastores. What if there was an even easier way? What if instead of submitting a new report request into a queue, you could simply ask your natural language question and directly interact with your data? This is exactly the functionality that Natural Language Processing (NLP) paired with an LLM delivers.

Let’s look at how we can implement NLP using LangChain and an OpenAI API. If you are not yet familiar with GenAI and some of its concepts, please see the earlier blog for an Overview of GenAI Keywords and Technologies.

Before diving into an in-depth example, some basic knowledge about LangChain is required. LangChain uses the concept of “chains” (wonder where it got its name). A chain is simply a way to link together different steps to get to an outcome. To help explore this, we will use the SQLDatabaseChain for our example. This chain takes in 3 inputs (database, LLM, prompt) and performs several steps behind the scenes to produce a plain text answer to the prompt. Let’s dive in to start understanding some of the magic we’ll see in this example.

Step 1 - Get database metadata

We first need to get metadata about our database in order to pass it into the LLM. Without this metadata, the LLM will not know which tables to query or columns to fetch. The metadata typically comes in the form of CREATE TABLE statements and a couple of rows of example data from that table.

Creating the database object that will get passed into the chain is shown in the example below. We are also using pymysql for the MySQL database connection.

If you have a particularly large database, you may run up against the token limits of your LLM, or you may just simply want to reduce the amount of tokens you are consuming. Let’s review a few ways to do this.

Options: Limit tables

You can limit which tables to include metadata for by populating a list for either include_tables or ignore_tables. Limiting the tables for metadata will also limit the tables for querying. This of course has pros and cons. If the LLM doesn’t know about a table, it can’t query it. However, this could also be a way to limit sensitive data from users.

Options: Limit sample rows and strings

You can also limit the number of sample rows and the amount of data from the sample rows using sample_rows_in_table_info and max_string_length, respectively. These are parameters you might need to experiment with and fine tune to ensure you are still providing enough context to the LLM to get the expected outcome. 

Options: Create views and denormalize

If you have a large amount of normalized data that is causing table sprawl, or if you have several metadata columns in your table that aren’t particularly useful, it might be worth creating a view or materialized view. By creating your own view you can consolidate tables and pare down the columns to only what is useful. This helps to limit the amount of data that needs to be passed to the LLM. And as you can see in the example below, a view name is handled exactly like a table name in the include_tables list when you set view_support to true.

One final option for reducing token consumption would be to manually include the table metadata in your prompt for the LLM. In this scenario, you can control exactly which tables, columns, and sample rows are shared, thereby making sure it is all relevant. One downside is that this option would not allow you to use the SQLDatabaseChain provided by LangChain as you are doing one of the steps manually.

Step 2 - Create your prompt

The prompt allows you to provide your natural language question as well as any additional instructions or clarifications about your database. In the example below, see how an explanation is used to tell the LLM how multi-tenancy is structured in the MySQL database. We structure some input variables to allow flexibility of the description as well as include the input question.

Step 3 - Execute your chain

Now that you have a database object from which the chain can extract metadata and your prompt that includes your question, it is time to run it. The last piece we need to set up is to point to an LLM, which we will do as part of this step as well.

SQLDatabaseChain includes options Verbose=True and return_intermediate_steps=True if you want to see exactly how it is executed. 

This full process will do the following:

  1. Get metadata about your database
  2. Using your prompt template, create a natural language prompt and append the database metadata
  3. Receive a SQL query from the LLM
  4. Execute the SQL query
  5. Pass the SQL query response back to the LLM with the question and ask it to answer the question based on the database return
  6. Return a natural language response to the question

By default the answer is returned in the output variable. However, you can include return_intermediate_steps which will return the output as an object that contains both the plain text output and all of the intermediate steps.

Conclusion

Is your company trying to figure out where to go with generative AI? Consider finding a partner who can help you get there. At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Catalyst, we can start the ideation process and guide you through the art of the possible for your business. Using these new ideas we can implement our Generative AI Knowledge Base Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. Finally, Caylent’s Generative AI Flight Plan Catalyst, will help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way.



Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Artificial Intelligence & MLOps
Clayton Davis

Clayton Davis

Clayton Davis is the Director of the Cloud Native Applications practice at Caylent. His passion is partnering with potential clients and helping them realize how cloud-native technologies can help their businesses deliver more value to their customers. His background spans the landscape of AWS and IT, having spent most of the last decade consulting clients across a plethora of industries. His technical background includes application development, large scale migrations, DevOps, networking and technical product management. Clayton currently lives in Milwaukee, WI and as such enjoys craft beer, cheese, and sausage.

View Clayton's articles

Related Blog Posts

OpenAI vs Bedrock: Optimizing Generative AI on AWS

The AI industry is growing rapidly and a variety of models now exist to tackle different use cases. Amazon Bedrock provides access to diverse AI models, seamless AWS integration, and robust security, making it a top choice for businesses who want to pursue innovation without vendor lock-in.

Artificial Intelligence & MLOps

AI-Augmented OCR with Amazon Textract

Learn how organizations can eliminate manual data extraction with Amazon Textract, a cutting-edge tool that uses machine learning to extract and organize text and data from scanned documents.

Artificial Intelligence & MLOps

Building Recommendation Systems Using Generative AI and Amazon Personalize

In this blog, learn how Generative AI augmented recommendation systems can improve the quality of customer interactions and produce higher quality data to train analytical ML models, taking personalized customer experiences to the next level.

Artificial Intelligence & MLOps