2025 GenAI Whitepaper

Using SQLDatabaseChain to implement NLP for Database Queries

Generative AI & LLMOps
Databases

Learn how to implement Natural Language Processing (MLP) using LangChain, SQLDatabaseChain, and an OpenAI API.

Learn how to implement Natural Language Processing (NLP) using LangChain and an OpenAI API through our comprehensive step-by-step guide.

Let’s look at how we can implement NLP using LangChain and an OpenAI API. If you are not yet familiar with GenAI and some of its concepts, please see the earlier blog for an Overview of GenAI Keywords and Technologies.

Natural language queries make for better BI

The entire industry of business intelligence (BI) exists to help non-technical users unlock meaningful data from these structured data stores. A database provides structure and structured data is easier to store, query, and optimize.

But what if there was an even easier way? What if instead of submitting a new report request into a queue, you could simply ask your natural language question and directly interact with your data? This is exactly the functionality that Natural Language Processing (NLP) paired with an LLM delivers.

Langchain framework

Before diving into an in-depth example, some basic knowledge about LangChain is required. LangChain uses the concept of “chains” . 

A chain is simply a way to link together different steps to get to an outcome. To help explore this, we will use the SQLDatabaseChain for our example. 

SQLDatabaseChain basics

SQLDatabaseChain is a specialized chain in LangChain that creates a bridge between large language models and SQL databases. It's designed to handle the entire process of converting natural language queries into SQL commands, executing them, and returning human-readable results.

This chain takes in 3 inputs (database, LLM, prompt) and performs several steps behind the scenes to produce a plain text answer to the prompt. 

SQLDatabaseChain's architecture is built on three fundamental inputs that work together to process and execute database queries:

  1. Database Connection: Provides access to the SQL database, including its schema, tables, and relationships. This component handles the actual data storage and retrieval.
  2. Language Model (LLM): Serves as the intelligence layer that understands natural language and generates SQL queries. The LLM is responsible for both translating user questions into SQL and converting query results back into natural language.
  3. Prompt Template: Structures how questions are formatted and presented to the LLM, including necessary context about the database schema and any specific instructions or constraints.

Behind the scenes, SQLDatabaseChain follows several key steps to produce a plain text answer:

  1. Schema Inspection: The chain first examines the database schema to understand available tables, columns, and relationships.
  2. Context Formation: It combines the user's question with relevant schema information and any additional context provided in the prompt template.
  3. Query Generation: The LLM processes this context to generate appropriate SQL code.
  4. Query Sanitization: The generated SQL is validated and checked for safety to prevent harmful operations.
  5. Execution: The validated query is run against the database to retrieve results.
  6. Result Processing: Raw query results are formatted and structured for readability.
  7. Natural Language Generation: The LLM transforms the processed results into a coherent, human-readable response that directly answers the original question.

These components and steps work in concert to provide a seamless experience where users can interact with databases using natural language while maintaining security and accuracy.

How SQLDatabaseChain works with natural language

First, a user submits a natural language question (for example, "How many sales occurred last month?"). This question is then processed by the chain and formatted according to the prompt template, which includes relevant context about the database schema and tables.

The DatabaseChain then leverages the LLM to analyze the formatted question and generate appropriate SQL code. During this translation phase, the LLM considers the database schema, relationships between tables, and the specific requirements of the question to create a valid SQL query.

Once the SQL query is generated, the chain executes it against the connected database. The execution phase involves running the query, handling any potential errors, and collecting the raw results from the database.

Finally, the raw SQL results are passed back through the LLM, which transforms them into natural language responses that are easy for users to understand. This might include formatting the data, providing explanations, or highlighting key insights from the query results, making the information accessible to users who may not be familiar with SQL or database structures.

Let’s dive in to start understanding some of the magic we’ll see in this example.

Step 1 - Get database metadata in SQLDatabaseChain

We first need to get metadata about our database in order to pass it into the LLM. Without this metadata, the LLM will not know which tables to query or columns to fetch. The metadata typically comes in the form of CREATE TABLE statements and a couple of rows of example data from that table.

Creating the database object that will get passed into the chain is shown in the example below. We are also using pymysql for the MySQL database connection.

If you have a particularly large database, you may run up against the token limits of your LLM, or you may just simply want to reduce the amount of tokens you are consuming. Let’s review a few ways to do this.

Options: Limit tables

You can limit which tables to include metadata for by populating a list for either include_tables or ignore_tables. Limiting the tables for metadata will also limit the tables for querying. This of course has pros and cons. If the LLM doesn’t know about a table, it can’t query it. However, this could also be a way to limit sensitive data from users.

Options: Limit sample rows and strings

You can also limit the number of sample rows and the amount of data from the sample rows using sample_rows_in_table_info and max_string_length, respectively. These are parameters you might need to experiment with and fine tune to ensure you are still providing enough context to the LLM to get the expected outcome. 

Options: Create views and denormalize

If you have a large amount of normalized data that is causing table sprawl, or if you have several metadata columns in your table that aren’t particularly useful, it might be worth creating a view or materialized view. By creating your own view you can consolidate tables and pare down the columns to only what is useful. This helps to limit the amount of data that needs to be passed to the LLM. And as you can see in the example below, a view name is handled exactly like a table name in the include_tables list when you set view_support to true.

One final option for reducing token consumption would be to manually include the table metadata in your prompt for the LLM. In this scenario, you can control exactly which tables, columns, and sample rows are shared, thereby making sure it is all relevant. One downside is that this option would not allow you to use the SQLDatabaseChain provided by LangChain as you are doing one of the steps manually.

Step 2 - Create your prompt

The prompt allows you to provide your natural language question as well as any additional instructions or clarifications about your database. In the example below, see how an explanation is used to tell the LLM how multi-tenancy is structured in the MySQL database. We structure some input variables to allow flexibility of the description as well as include the input question.

Step 3 - Execute your chain

Now that you have a database object from which the chain can extract metadata and your prompt that includes your question, it is time to run it. The last piece we need to set up is to point to an LLM, which we will do as part of this step as well.

SQLDatabaseChain includes options Verbose=True and return_intermediate_steps=True if you want to see exactly how it is executed. 

This full process will do the following:

  1. Get metadata about your database
  2. Using your prompt template, create a natural language prompt and append the database metadata
  3. Receive a SQL query from the LLM
  4. Execute the SQL query
  5. Pass the SQL query response back to the LLM with the question and ask it to answer the question based on the database return
  6. Return a natural language response to the question

By default the answer is returned in the output variable. However, you can include return_intermediate_steps which will return the output as an object that contains both the plain text output and all of the intermediate steps.

Caylent’s approach to generative AI products

At Caylent, we have a full suite of generative AI offerings. Starting with our Generative AI Strategy Catalyst, we can start the ideation process and guide you through the art of the possible for your business. 

Using these new ideas we can implement our Generative AI Knowledge Base Catalyst to build a quick, out-of-the-box solution integrated with your company's data to enable powerful search capabilities using natural language queries. 

Finally, Caylent’s Generative AI Flight Plan Catalyst, will help you build an AI roadmap for your company and demonstrate how generative AI will play a part. As part of these Catalysts, our teams will help you understand your custom roadmap for generative AI and how Caylent can help lead the way.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings
Generative AI & LLMOps
Databases
Clayton Davis

Clayton Davis

Clayton Davis is the Director of the Cloud Native Applications practice at Caylent. His passion is partnering with potential clients and helping them realize how cloud-native technologies can help their businesses deliver more value to their customers. His background spans the landscape of AWS and IT, having spent most of the last decade consulting clients across a plethora of industries. His technical background includes application development, large scale migrations, DevOps, networking and technical product management. Clayton currently lives in Milwaukee, WI and as such enjoys craft beer, cheese, and sausage.

View Clayton's articles

Related Blog Posts

Speech-to-Speech: Designing an Intelligent Voice Agent with GenAI

Learn how to build and implement an intelligent GenAI-powered voice agent that can handle real-time complex interactions including key design considerations, how to plan a prompt strategy, and challenges to overcome.

Generative AI & LLMOps

Whitepaper: The 2025 Outlook on Generative AI

Generative AI & LLMOps

Beyond the Hype - Evaluating DeepSeek's R1

DeepSeek’s R1 is making waves, but is it truly a game-changer? In this blog, we clear the smoke, evaluating R1’s real impact, efficiency gains, and limitations. We also explore how organizations should think about R1 as they look to leverage AI responsibly.

Generative AI & LLMOps