How to Simplify SQL with Text-to-SQL Technology

In today's data-driven world, leveraging insights from analytics of internal / external data sources is crucial to achieving business outcomes. To the data-savvy, this means generating SQL queries. However, writing SQL code can be a daunting, complex and time-consuming task, often requiring significant technical expertise. Enter Text-to-SQL technology, a transformative approach that leverages Natural Language Processing (NLP) and Generative AI (Gen AI) to convert natural language queries into SQL code, making data access more accessible to non-technical users. In this blog post, we'll explore how Text-to-SQL works and how Gen AI enhances its capabilities, its benefits and challenges, and how a data fabric can enhance the power of this technology.

Want to learn more about how to implement text-to-SQL using an underlying data fabric. Click here to download the solutions brief.

Get the Full Solutions Brief

Understanding Text-to-SQL

Text-to-SQL technology bridges the gap between natural language and SQL code, allowing users to generate SQL queries by simply asking questions in their everyday language. The process can be broken down into three primary components:

User Input: The user provides a query in natural language, such as "What are the revenues from product X over the last year?" This input can be in English or any other language, making it highly accessible.
NLP Processing: Natural Language Processing (NLP) techniques parse the natural language query to understand the user's intent and context. This may involve several steps, including:
1. Tokenization: The query is broken down into meaningful parts, or tokens, which makes it easier to search for relevant data.
2. Named Entity Recognition: Identifies relevant elements such as dates, product names, and other key entities that need to be included in the SQL query.
Query Generation: A large language model (LLM) that has been trained on a corpus of internal data, often the data warehouse, uses the structured information from the NLP processing to construct a corresponding SQL query. The system ensures that the generated query is syntactically correct and ready to execute.

While this process sounds relatively straightforward in theory, its implementation in practice is way more challenging. Broadly speaking there are two different levels of complexity in Text-to-SQL applications:

Text-to-SQL on Pre-Curated Datasets: This is the easier approach as datasets are already curated and structured. The system only needs to generate SQL queries based on predefined datasets, reducing complexity and improving accuracy. The reason why this is easy is because:
1. The data is already defined. The user does not need to discover/determine where to pull the raw data to construct this curated dataset.
2. The data is already curated. The user does not need to do complex data prep and transformation as the data is already in a “pretty” or easy to understand format (e.g. column names that make sense in a business context vs technical column names or business symantec logic has already been applied so each column can be simply selected and plotted.)
3. So, the LLM simply needs to match the words in the question to the column name that most closely matches and generate a relatively easy SQL statement of a SELECT statement with some ‘GROUP BY’ and maybe ‘FILTERS’ off of what is essentially 1 table (or even a few pre-curated tables).
Building Datasets from Raw Tables: This approach is more challenging because it involves generating SQL queries from scratch using raw, un-curated tables. The system must understand the underlying data structure and relationships between tables, making it a complex task.

It is important to understand the difference between those two. While most text-to-SQL tools deliver accurate and performant results on curated datasets, the business reality of most organizations is that not all data is stored in the same locations, much less only in pre-curated datasets. There are many memes on this and it is an inside joke among data professionals. Here’s one of my favorites that all of us have encountered if you’re on the receiving end of a request from the business.

A Twitter post joking about the expectations vs reality of data analysis in corporations

This fact makes adoption of Text-to-SQL so much harder in practice because that discovery and rationale to select the raw data to assemble into this “idea, clean and pristine” table is a very time-intensive and complex task.

Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is crucial in enabling effective Text-to-SQL functionality. RAG works by integrating a retrieval mechanism with the generation process to enhance the accuracy and relevance of the generated SQL queries. Here's how that works in the context of Text-to-SQL:

An outline of the RAG Retrieval Augmented Generation System as it can be used for a text to SQL generation

Prompt: The user inputs a prompt in the form of natural language without the need to know the technical name of the data in question.
Query: The RAG system searches databases and other data sources for relevant information.
Relevant Information for Enhanced Context: The RAG system retrieves relevant information for additional context. Vector databases play a critical role here by enabling efficient similarity searches to find the most contextually relevant information.
Prompt, Query, and Enhanced Context: The system enriches the retrieved information with additional context and metadata.
Generate Text Response: The enhanced context and query are passed to the LLM to generate an accurate SQL query.
SQL Output: The final SQL query is ready to execute, providing the user with the desired data.

RAG enhances the Text-to-SQL process by ensuring that the generated queries are not only syntactically correct but also contextually relevant and accurate. Vector databases play a critical role in this to enable efficient similarity searches. This framework is key to making Text-to-SQL a practical and reliable tool for data access and analysis. More sophisticated frameworks will add additional layers to challenge, test, and verify the SQL outputs, but essentially follow the same logic.

The Pros and Cons of Automated SQL Generation

Intuitively, automated SQL generation offers a number of key advantages to business and users:

It’s efficient: Significantly speeds up the query creation process, freeing up time for data analysts to focus on more critical tasks.
It’s iterative: Users can continue to modify their prompts after seeing results in real-time until they see the output they desire..
It’s accessible: Makes data queries accessible to non-technical users, empowering them to make data-driven decisions without needing to write SQL code.

However, while the benefits are clear, there are also some challenges that need to be addressed in order for Text-to-SQL to work at scale:

Making results relevant and accurate: Ensuring the AI accurately interprets the context of a user query is crucial. Generated code must be verified and potentially debugged. But most importantly, it needs to be executable. LLMs have a tendency to introduce hallucination and that can introduce improper SQL syntax or inaccurate table selection, especially when there are many tables and/or columns with similar names.
Check for and handle errors: Automated code needs to be checked for errors and adjusted as needed. This brings up a need for transparency. Does the LLM provide the details on how and why it generated the output and does so in a way that is both transparent and editable so that users can trust the output?
Adapt to changing data structures: Data structures and schemas often change, requiring adaptable query generation. Similarly, data is dispersed across multiple data sources.

Data Fabric: Enhancing Text-to-SQL

A data fabric is a solution that provides a unified and consistent view of all the relevant data in an enterprise coupled with integrated capabilities enabling seamless data integration, management, and access. By creating a unified abstracted data layer, a data fabric simplifies and enhances data operations, making it easier to work with data from multiple sources in a cohesive manner.

Promethium Text to SQL Demonstration

Like we established earlier, most organizations do not have all of their data in a single warehouse. And even if that is the case, a data warehouse is missing the necessary semantics to deliver accurate results. That is where a data fabric architecture has significant benefits in achieving Text-to-SQL:

A single API to offload Prompt Engineering: A data fabric provides a single API for accessing diverse data sources, eliminating the need to manage multiple connections and interfaces. This simplifies data integration and ensures that data can be accessed and queried seamlessly, regardless of where it is stored. There’s a great report from Gartner on why this crucial capability from a Data Fabric can introduce the much needed enterprise context required with prompt engineering to retrieve accurate results.

A Gartner diagram showing how a data fabric offloads prompt engineering and introduces enterprise context for AI

Higher Relevance and Better Context: Utilizing active and passive metadata alike, a data fabric provides both the necessary context as well as the accurate data selection and assembly required to generate accurate SQL queries. Organizations have learned to use passive metadata such as taxonomies, tags, data quality, and lineage to help the human with data discovery. But, Active Metadata includes information about semantics, data usage, data quality, lineage, and user ratings, which helps the system understand the data better and generate more relevant queries. Again, there’s another great research from Gartner on why using Active Metadata in a Data Fabric is a perfect compliment to LLMs.
AI-Powered Workflow: A data fabric enables AI-powered workflows, automating the entire process of query generation. From parsing natural language queries to executing SQL code, the workflow is streamlined, reducing manual intervention and errors. This means that the entire process is not only accurate, has the relevant context but is also complete. Rather than only having the SQL generation automated by the LLM, a data fabric provides the automated steps before the accurate table selection and transformation as well as the output creation and validation, which not only saves time but creates the much needed transparency and agility needed.

Because the data fabric offers seamless access to diverse data sources through a unified layer, it enriches data with contextual information and metadata for better insights, and it facilitates the generation of accurate and actionable insights through integrated data and advanced analytics, an underlying data fabric architecture has key advantages:

It improves data retrieval: A data fabric enables faster and more efficient access to diverse data sources through a unified data layer. This ensures that the Text-to-SQL system can quickly find and retrieve the relevant data needed to generate accurate queries.
It enhances data quality: By enriching data with contextual information and metadata, a data fabric improves the quality of the data being queried. This leads to more accurate and meaningful SQL queries, providing better insights.
It accurately generates insights: The combination of a data fabric's robust data integration capabilities and Text-to-SQL's query generation ensures that the insights generated are accurate and actionable. This is achieved through the integrated data and advanced analytics provided by the data fabric.
It provides relevant results: The data fabric layer stitches together all parts of the ecosystem. It provides a semantic layer that understands relationships between data sources and data. This active metadata fuels the recommendation engine of the Text-to-SQL generation and enables the system to automatically find and integrate data across various source systems.
It’s scalable and flexible: A data fabric's scalable architecture adapts to changing data environments and business needs, ensuring long-term viability. This flexibility is crucial for maintaining accurate and relevant query generation as data sources and structures evolve.
It’s secure and ensures privacy: Instead of either having to load all data to a single warehouse or having to send data to a public LLM, the data fabric centrally enforces governance and access rights and ensures that no actual data is shared with LLMs.

Without a data fabric, there is a real risk of "LLM silos," where each data source has its own LLM, leading to isolated knowledge and non-real-time data access. We are already seeing this today. Every data platform, visualization tool, and data catalog has their own integrated LLM. This creates two challenges: having to use different and multiple prompts and making sure that the retrieval of data from multiple prompts is actually the same and consistent across the tool (more on this topic in an upcoming blog post). We already have multiple sources of data. Do we really need to use multiple LLMs? Doesn’t that defeat the purpose of simplicity? At Promethium, we believe that a unified approach ensures performance and relevance, addressing both data privacy challenges and the inefficiencies of traditional solutions. Have one LLM (of your choice) but have that LLM work across not all your data but all your tools to provide the integrated context.

Putting Text-to-SQL to Work

In practice, the synergistic relationship between data fabric and Text-to-SQL boils down to the following components:

Retrieval Augmented Generation (RAG): This framework is key to enabling Text-to-SQL. The RAG system searches for relevant information across databases, enriches it with context, and uses it to generate accurate SQL queries.
Real-Time Data Access: Provides instant access to data, allowing for timely and relevant query responses.
Metadata Utilization: Leverages Active Metadata to enhance the accuracy and relevance of SQL code, ensuring that queries are both syntactically correct and contextually appropriate. Activating available metadata is the key ingredient to providing the relevant context to make Text-to-SQL work across multiple databases.

A flow chart breaking down the text to SQL workflow

Let us take a look under the hood of how Promethium makes Text-to-SQL a reality. It is a process that leverages all the steps we have talked about in this post to generate an accurate and relevant SQL statement:

User Input: The user starts by asking a question or providing a query in natural language.
Words Parsed into Tokens: The NLP engine breaks down the input into tokens, which are meaningful parts of the query.
Token Search: These tokens are then searched against an index to identify relevant data elements.
AI Determination: AI algorithms determine the best available choices of tokens that match the user’s intent.
Query Construction: The tokens and user query are passed to an LLM, which constructs the corresponding SQL query.
Query Execution: The generated SQL query is executed against the relevant data sources to provide the desired results.

Use Case: Creating Patient Insights at a F500 Blood Testing Company

Promethium's Text-to-SQL technology has been successfully implemented at a Fortune 500 blood testing company, revolutionizing their data management and accessibility. By integrating Promethium's data fabric, the company gained comprehensive visibility across all their data sources. This seamless connection to diverse datasets enabled the company to build and query datasets on demand, whether they were raw or pre-curated.

The data fabric's integration allowed anyone with access to easily query and consume data through an intuitive chat interface. This capability significantly improved the company's ability to respond quickly to patient, physician, and customer needs. The prompt and efficient access to critical data not only enhanced operational efficiency but also ensured higher quality service and patient care. The result was a faster, more responsive organization capable of making informed decisions in real time.

Conclusion

Text-to-SQL technology has the ability to revolutionize the way we interact with data, making it more accessible and efficient. By combining the power of Gen AI and a data fabric architecture, you can get a powerful solution that simplifies SQL query generation.

Compared to other approaches, the combination is able to truly democratize data access with a solution that delivers accurate and relevant results, regardless of whether you know where your data is located.

If you are curious to learn more about Text-to-SQL and see how Promethium works in practice, watch the full webinar by our founder Kaycee Lai here: Using GenAI to Generate SQL: What You Really Need to Know or download our full solutions brief.

Ready to see the next evolution of Text-to-SQL. Click here to explore Promethium's Enterprise Data Copilot and unlock insights and build data products 100x faster than before. Now in Private Preview!

Explore the New Promethium Enterprise Data Copilot

If you want to learn more about how Promethium can help your organization democratize data access and become more data-driven, contact us now to schedule a demo: https://get.promethium.ai/contact-sales