Text-to-SQL Leaderboard & Evaluation Metrics Guide

Text-to-SQL systems promise to democratize data access—allowing anyone to ask questions in plain English and get accurate answers from their databases. For teams evaluating these systems, benchmark leaderboards and accuracy metrics seem to tell a clear story: models now achieve 85-90% accuracy on standard tests, with some approaching human performance.

But here’s the reality check: deploy these same systems in an enterprise environment, and accuracy often plummets to 6-16% on real-world tasks. The gap isn’t a modeling problem — it’s an architecture problem. Generating syntactically correct SQL is just the first hurdle. The bigger challenge is generating queries that answer ambiguous questions accurately across heterogeneous data sources with incomplete context.

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Text-to-SQL Benchmark Landscape

Before understanding why enterprise deployments struggle, let’s examine how the industry currently measures text-to-SQL performance.

Spider: The Academic Gold Standard

Spider remains the most influential benchmark since its 2018 introduction. The dataset contains 10,181 questions and 5,693 unique SQL queries spanning 200 databases across 138 different domains. What distinguishes Spider from earlier benchmarks is its cross-domain evaluation: training and test sets contain completely different databases, forcing models to genuinely generalize rather than memorize patterns.

The benchmark emphasizes real-world complexity through multi-table databases requiring joins, nested queries, aggregations, and set operations. Questions range from straightforward lookups to intricate queries involving multiple tables with foreign key relationships. This complexity stratification helps identify where models struggle — basic schema linking versus advanced compositional reasoning.

Leading systems on Spider now achieve impressive results:

Fine-tuned GPT-4 with retrieval-augmented generation: 83-84% execution accuracy
GPT-4 baseline with full schema context: 70% execution accuracy
Specialized systems like DIN-SQL and DAIL-SQL: 85-86% execution accuracy

These figures suggest the problem is nearly solved. But Spider 2.0 reveals the real challenge.

Spider 2.0: The Reality Check

Released in 2024, Spider 2.0 represents a quantum leap in difficulty. Rather than isolated queries, it tests 632 real-world text-to-SQL workflows derived from enterprise use cases. The benchmark incorporates production environments—Snowflake, BigQuery, DuckDB—with schemas exceeding 3,000 columns.

Performance drops dramatically:

Spider 2.0-Snow (Snowflake): 59% peak accuracy
Spider 2.0-Lite (multi-platform): 38% peak accuracy
GPT-4 overall success rate: 6%

The dramatic gap between Spider (86%) and Spider 2.0 (6%) reveals what benchmarks miss: complex schemas, enterprise-scale databases, multiple SQL dialects, and multi-step reasoning requirements.

Complementary Benchmarks

Other benchmarks address specific dimensions:

SParC extends Spider with conversational context, requiring models to interpret queries across dialogue turns rather than treating each question independently.

CoSQL adds response generation and intention classification tasks to conversational scenarios.

BIRD focuses on 95 large enterprise databases with dirty data, external knowledge requirements, and efficiency considerations. GPT-4 achieves 52% accuracy compared to 93% for human experts.

BIRD-INTERACT introduces multi-turn interactions with ambiguous queries and execution errors. GPT-5 completes only 8-17% of tasks.

The pattern is clear: as benchmarks incorporate real-world complexity—ambiguous questions, conversational context, dirty data, massive schemas—accuracy collapses.

Text-to-SQL Evaluation Metrics: Measuring the Right Thing

The evolution of evaluation metrics reveals growing sophistication in understanding what “correct” means for text-to-SQL systems.

Exact Set Match: Simple But Flawed

Exact Set Match (ESM) decomposes predicted and gold SQL queries into component sets — SELECT clauses, WHERE conditions, GROUP BY elements—then compares these sets for exact matches. While more sophisticated than raw string comparison, ESM suffers from significant false negatives.

Semantically equivalent queries with different syntactic representations incorrectly fail evaluation. For instance, SELECT * FROM table WHERE age > 25 and SELECT * FROM table WHERE 25 < age are logically equivalent but fail exact matching. Queries may use different join orders, equivalent subqueries versus joins, or alternative aggregate formulations while producing identical results.

ESM measures syntax similarity, not semantic correctness.

Execution Accuracy: Better But Incomplete

Execution accuracy evaluates whether the predicted SQL query produces the same result as the reference query when executed on a test database. This captures semantic equivalence that ESM misses and has become widely adopted.

However, execution accuracy introduces its own limitation: false positives. Semantically different queries may coincidentally produce identical outputs on specific database instances. A query that accidentally returns correct results on test data might fail when database content changes or when executed on different instances.

Research shows execution accuracy can overestimate performance, particularly when test databases have limited diversity or specific value distributions.

Test Suite Accuracy: The Current Standard

To address execution accuracy’s false positive vulnerability, test suite accuracy was introduced and adopted as the official Spider metric in 2020. Rather than executing queries on a single database instance, test suite accuracy generates a carefully curated collection of diverse database instances and validates queries across all of them.

The test suite is created through systematic fuzzing: generating many random database instances while ensuring high code coverage for the gold query. This makes it far less likely that an incorrect query will coincidentally succeed across all test cases.

Analysis shows traditional execution accuracy correlates with test suite accuracy at approximately 74-90%, depending on query complexity. Harder queries show greater divergence—exactly where accurate evaluation matters most.

Emerging Approaches

Recent research proposes even more sophisticated evaluation:

Enhanced Tree Matching (ETM) compares abstract syntax trees of queries after applying verifiable equivalence transformations, reducing both false positives and false negatives.

Valid Efficiency Score (VES) evaluates both correctness and query performance, recognizing that production systems must balance accuracy with computational cost.

These metrics represent progress, but they still fundamentally measure the same thing: whether the generated SQL syntax is correct. They don’t measure whether the query answers the question the user actually asked.

The Enterprise Reality: Syntax Correctness Isn’t Enough

Here’s the problem with benchmark-driven development: text-to-SQL evaluation metrics assume questions have unambiguous answers and that the challenge is generating syntactically correct SQL to retrieve those answers.

Enterprise reality is fundamentally different.

The Ambiguity Problem

Consider a simple question: “Show me revenue by region for last quarter.”

On a benchmark with a single database schema, this maps to a clear SQL query. In an enterprise environment, this question requires interpretation:

Which revenue definition? Gross revenue, net revenue, recognized revenue, billed revenue?
Which regions? Sales territories, geographic regions, organizational divisions?
Last quarter relative to what? Fiscal quarter, calendar quarter, when does the quarter start?
Include adjustments? Exclude returns? Apply exchange rates?

The question isn’t ambiguous to the person asking—they have implicit context about what these terms mean in their organization. But that context isn’t in the database schema.

The Heterogeneous Data Problem

Benchmarks assume data lives in a single, well-designed database. Enterprises have data scattered across:

Cloud data warehouses (Snowflake, Databricks, BigQuery)
On-premise databases (Oracle, SQL Server, DB2)
SaaS applications (Salesforce, Workday, ServiceNow)
Data lakes and object storage (S3, ADLS)

Answering “Show me revenue by region” might require joining Salesforce opportunity data with ERP transaction data and mapping both to a shared regional hierarchy defined in a completely separate system.

Generating syntactically correct SQL against any single database doesn’t solve this problem. The system needs to understand which systems contain relevant data, how concepts map across systems, and what business logic connects them.

The Definition Problem

Benchmarks provide clean schemas where column names clearly indicate contents. Enterprise databases have columns like amt_1, stat_cd, and eff_dt—requiring tribal knowledge to understand.

Even when column names seem clear, definitions matter. “Customer” might mean:

Anyone who has ever purchased (historical definition)
Active accounts with purchases in the last 12 months (operational definition)
Accounts meeting minimum revenue thresholds (strategic definition)

Each definition is “correct” for different contexts. Generating the right SQL requires knowing which definition applies to the question being asked.

Solving the Context Problem: The Five Levels of Accuracy

The gap between benchmark performance and enterprise accuracy comes down to context. As you add layers of context, accuracy increases—but most text-to-SQL systems stop at Level 1 or 2.

Level 1: Raw Technical Metadata (Schema, Tables, Columns)

This is what benchmarks test: understanding database schemas and generating syntactically correct SQL. Models achieve 70-86% accuracy at this level on controlled datasets.

But in enterprise environments with thousands of tables and unclear naming conventions, even this becomes challenging. Without additional context, models must guess which tables are relevant and what columns mean.

Level 2: Semantic Layer (Metrics, Hierarchies, Shared Definitions)

Semantic layers—from tools like dbt, Looker, or AtScale—provide business-friendly names and calculations on top of raw schemas. “Revenue” becomes a defined metric with explicit logic. “Region” becomes a dimensional hierarchy with clear relationships.

This helps models understand business concepts but still doesn’t solve ambiguity. Which revenue metric? Which regional hierarchy? Semantic layers define multiple versions of the same concepts for different use cases.

Level 3: Catalog & Business Definitions (Glossary, Certified Data, Golden Queries, Ownership)

Data catalogs like Alation, Collibra, or Atlan provide additional layers:

Business glossaries defining official organizational terms
Data quality indicators showing which tables are trustworthy
Golden queries showing how questions have been answered correctly before
Ownership information identifying subject matter experts

This context helps models understand organizational conventions and precedents. When a question is similar to a previously answered query, the model can reference that pattern.

Level 4: Cross-System Business Logic (Joins, Rules, Calculations, Policies)

The next layer captures how data from different systems connects:

How Salesforce opportunities map to ERP transactions
Business rules for calculating metrics across systems
Policies for combining data with different grain or definitions
Approved methods for reconciling conflicts between sources

This context enables federated queries that correctly combine heterogeneous data according to organizational standards.

Level 5: Tribal Knowledge & Memory (Preferences, Patterns, Reinforcement)

The highest accuracy level incorporates knowledge that exists only in people’s heads:

How different departments interpret the same terms differently
Implicit assumptions about data quality and usage
Patterns in how specific executives prefer to see metrics
Corrections and refinements from past interactions

Capturing this through human reinforcement—subject matter experts validating and correcting answers—creates a learning system that gets more accurate over time.

Why Most Text-to-SQL Systems Stop at Level 2

Building a text-to-SQL system that excels on benchmarks requires mastering Levels 1 and 2: understanding schemas and semantic layers. This is what most vendors and research teams focus on because it’s what evaluation metrics measure.

But enterprise accuracy requires all five levels. Here’s why most systems don’t get there:

Architectural limitations: Systems built around single databases can’t naturally incorporate context from multiple catalogs, semantic layers, and business systems. They’d need to denormalize all context into their primary database—an impossible maintenance burden.

No context aggregation: Even systems that theoretically connect to multiple sources don’t aggregate and unify context. They query each source separately rather than building a unified understanding across them.

No learning mechanism: Benchmark-trained models generate queries from static knowledge. They don’t learn from corrections, remember successful patterns, or improve through organizational-specific reinforcement.

No collaboration workflow: Production systems need subject matter experts to validate answers and provide corrections. Most text-to-SQL tools are black boxes with no mechanism for human expertise to improve the system.

The AI Insights Fabric Approach

Achieving enterprise-grade accuracy requires architectural changes, not just better models. This is why Promethium built the AI Insights Fabric with three integrated layers:

Universal Query Engine provides zero-copy access to all enterprise data sources—cloud warehouses, on-premise databases, SaaS applications. Questions can span Snowflake, Salesforce, Oracle, and S3 without moving data.

360° Context Hub aggregates context from all five levels:

Technical metadata from data sources (Level 1)
Semantic models from BI tools and dbt (Level 2)
Business definitions from data catalogs (Level 3)
Cross-system rules and joins (Level 4)
Organizational memory and human reinforcement (Level 5)

The context hub doesn’t just collect this information—it unifies it into a single graph that models can query to understand ambiguous questions correctly.

Answer Orchestrator uses aggregated context to generate queries that answer what users mean, not just what they literally asked. When questions are ambiguous, the agent asks clarifying questions. When answers are validated by subject matter experts, that reinforcement improves future responses.

The result: systems that achieve benchmark-level syntax correctness while also delivering enterprise-level semantic accuracy.

Evaluating Text-to-SQL for Enterprise Deployment

If you’re evaluating text-to-SQL systems for your organization, here’s what to look for beyond benchmark scores:

Test with Organizational Questions

Don’t evaluate on generic questions. Test with real questions your teams ask that require organizational context:

Questions using company-specific terminology
Questions spanning multiple systems
Questions with ambiguous terms that have clear meaning internally
Questions that have been answered before through manual analysis

If the system can’t handle these with high accuracy, benchmark scores are irrelevant.

Measure Semantic Accuracy, Not Just Syntax

Don’t just check if the SQL is syntactically valid. Evaluate whether:

The query accesses the right data sources
The results align with how your organization defines terms
The answer would satisfy the person who asked the question
Subject matter experts would trust the results

This requires human evaluation, not just automated metrics.

Assess Context Integration Capabilities

Ask vendors:

How do you integrate with our data catalog and semantic layer?
Can you incorporate custom business rules and calculations?
How do you learn from corrections and validated answers?
Can subject matter experts provide reinforcement to improve accuracy?

Systems that can’t answer these questions will struggle with enterprise complexity.

Plan for Ongoing Improvement

The best text-to-SQL systems get more accurate over time as they learn organizational patterns. Evaluate:

How the system captures and applies feedback
Whether validated answers become reusable templates
If the system identifies when it needs human guidance
How tribal knowledge gets formalized and shared

Static systems that don’t improve will require constant manual correction.

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Path to Production-Ready Text-to-SQL

The gap between 86% benchmark accuracy and 6% real-world accuracy isn’t a failure of language models. It’s a failure to recognize that generating correct SQL syntax is only the beginning.

Enterprise accuracy requires solving the context problem—aggregating technical metadata, semantic definitions, business rules, and tribal knowledge into a unified understanding that enables systems to interpret ambiguous questions correctly.

This is why Promethium focuses on context first and queries second. The Universal Query Engine, 360° Context Hub, and Answer Agent work together to deliver accuracy that improves over time as organizational knowledge gets captured and applied.

Text-to-SQL benchmarks measure progress on syntax correctness. Enterprise deployments require semantic correctness. The difference is everything.

Want to see how context-driven text-to-SQL achieves production-level accuracy in your environment? Download our white paper on our new approach to data architecture or talk to one of our experts today.

Evaluating Text-to-SQL Systems: Why Syntax Is Just the Beginning

Table of Contents

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Text-to-SQL Benchmark Landscape

Spider: The Academic Gold Standard

Spider 2.0: The Reality Check

Complementary Benchmarks

Text-to-SQL Evaluation Metrics: Measuring the Right Thing

Exact Set Match: Simple But Flawed

Execution Accuracy: Better But Incomplete

Test Suite Accuracy: The Current Standard

Emerging Approaches

The Enterprise Reality: Syntax Correctness Isn’t Enough

The Ambiguity Problem

The Heterogeneous Data Problem

The Definition Problem

Solving the Context Problem: The Five Levels of Accuracy

Level 1: Raw Technical Metadata (Schema, Tables, Columns)

Level 2: Semantic Layer (Metrics, Hierarchies, Shared Definitions)

Level 3: Catalog & Business Definitions (Glossary, Certified Data, Golden Queries, Ownership)

Level 4: Cross-System Business Logic (Joins, Rules, Calculations, Policies)

Level 5: Tribal Knowledge & Memory (Preferences, Patterns, Reinforcement)

Why Most Text-to-SQL Systems Stop at Level 2

The AI Insights Fabric Approach

Evaluating Text-to-SQL for Enterprise Deployment

Test with Organizational Questions

Measure Semantic Accuracy, Not Just Syntax

Assess Context Integration Capabilities

Plan for Ongoing Improvement

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Path to Production-Ready Text-to-SQL

Table of Contents

Building AI Agents That Don’t Hallucinate on Enterprise Data

Data Fabric for AI: Building the Foundation for Trusted AI Agents

7 Signs Your Data Stack Isn’t Ready for AI Agents in 2026

Evaluating Text-to-SQL Systems: Why Syntax Is Just the Beginning

Table of Contents

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Text-to-SQL Benchmark Landscape

Spider: The Academic Gold Standard

Spider 2.0: The Reality Check

Complementary Benchmarks

Text-to-SQL Evaluation Metrics: Measuring the Right Thing

Exact Set Match: Simple But Flawed

Execution Accuracy: Better But Incomplete

Test Suite Accuracy: The Current Standard

Emerging Approaches

The Enterprise Reality: Syntax Correctness Isn’t Enough

The Ambiguity Problem

The Heterogeneous Data Problem

The Definition Problem

Solving the Context Problem: The Five Levels of Accuracy

Level 1: Raw Technical Metadata (Schema, Tables, Columns)

Level 2: Semantic Layer (Metrics, Hierarchies, Shared Definitions)

Level 3: Catalog & Business Definitions (Glossary, Certified Data, Golden Queries, Ownership)

Level 4: Cross-System Business Logic (Joins, Rules, Calculations, Policies)

Level 5: Tribal Knowledge & Memory (Preferences, Patterns, Reinforcement)

Why Most Text-to-SQL Systems Stop at Level 2

The AI Insights Fabric Approach

Evaluating Text-to-SQL for Enterprise Deployment

Test with Organizational Questions

Measure Semantic Accuracy, Not Just Syntax

Assess Context Integration Capabilities

Plan for Ongoing Improvement

Want to understand where text-to-SQL implementations stand at the end of 2025? Read our comprehensive trend report.

The Path to Production-Ready Text-to-SQL

Table of Contents

Share This Article

SHARE THIS:

Want to stay in the loop?

Share This Article

SHARE THIS:

Want to stay in the loop?

Stay Ahead with Expert Insights

Related Guides

Building AI Agents That Don’t Hallucinate on Enterprise Data

Data Fabric for AI: Building the Foundation for Trusted AI Agents

7 Signs Your Data Stack Isn’t Ready for AI Agents in 2026