Why Text-to-SQL Fails and Architectures That Solve This Challenge

One of the most common questions I hear when discussing AI and analytics is: 

“Why can’t I just point an LLM at my SQL database and get answers?” 

At first glance, this seems reasonable. If large language models can write SQL, why not let them query the database directly? For simple questions such as: 

“How many customers do I have?”  

A single LLM may perform well because the answer likely lives in a single table, and the SQL is straightforward. However, a more enterprise-type question might look something like this:  

“How many customers do I have in each state in the United States who have made more than five purchases?” 
 
Now, the AI model suddenly has to: 

  • Understand multiple domain concepts 
  • Evaluate relationships across several tables 
  • Perform joins, grouping, filtering, and conditional aggregation 
  • Retrieve domain knowledge that possibly exists outside of the database and table schemas

In other words, the model must reason across semantics, context, and enterprise specific logic, all without hallucinating and this is where LLMs can break down. To better understand this challenge and approaches for how to design architectures to solve this, let’s examine a recent study “Text-to-SQL for Enterprise Data Analytics”, conducted by Chen et al. (2025).

    Why Single LLMs are Not Enough 

    Chen et al. (2025), start their research by asking a fair question:

    “Do we even need agents? Or can a single LLM solve this?”

    It turns out single LLMs struggle heavily when it comes to complex enterprise data structures as seen when we come the Spider 1.0 and Spider 2.0 benchmarks. You can also check out the most current rankings here.

    Benchmark Test Criteria Single LM Accuracy 
    Spider 1.0 Simple, academic GPT-4 90% 
    Spider 2.0 True enterprise complexity GPT-4 drops to 13%, o1 and o3 scored slightly better in the 20-30% range 
    Snowflake Internal Eval Real enterprise data ~51% (GPT-4o) 

    These benchmark results confirm a key point that enterprise text to SQL truly requires multi-stage, agentic workflows.

    Key Architectural Components

    To overcome this limitation, Chen et al. (2025) propose and evaluate agentic design principles that can bridge these performance gaps.

    A Comprehensive Enterprise Knowledge Graph 

    First, the researchers designed and implemented a knowledge graph. A knowledge graph is crucial for helping the LLM reason about semantics and the knowledge graph designed in the study was built around the following integrations: 

    • Table schema metadata 
    • Documentation, business definitions, and organization jargon (such as abbreviations, metrics names, products, team names, etc. 
    • Historical SQL queries 
    • Code repositories 

    The graph, shown in Figure 1, is designed to provide the LLM with contextual understanding of how data is used, what tables mean, and how various systems relate, something a standalone model cannot infer.

        Chen et al. (2025) Figure 1. Text-to-SQL for Enterprise Data Analytics. https://arxiv.org/abs/2507.14372 

        Table Clustering 

        In an enterprise, we may have thousands if not millions of tables, but not all are relevant for every user query. Chen et al. (2025), explain that by clustering data, we can help Ai Agents interpret enterprise data with greater accuracy because this process helps remove noise and low-relevance context which can lead to hallucinations. The clustering process used in the study includes the following key steps: 

        • Preparing user access data 
        • Evaluate the datasets for user access counts and which users are access the datasets 
        • Filtering and scaling out the data
        • Reduce noise and only include datasets with sufficient number of total and unique user access and remove rarely used tables 
        • Perform Dimensionality Reduction (ICA) and apply Independent Component Analysis to score each table and component

        In their study, Chen et al. (2025) found that without data clustering, accuracy was as low as 9% compared to an accuracy rate of 48% with clustering. These results validate how essential clustering is for enterprise RAG and agentic SQL workflows.

        Multi-Stage Retrieval and Query-Writing Agent 

        Lastly, rather than feeding the model everything, the team discussed their approach to building a multi-stage process that: 

        • Retrieves all potentially relevant tables 
        • Ranks them 
        • Selects a top subset 
        • Enriches context using the knowledge graph 
        • Passes only the highest-quality context into the Query Writer Agent 
        • Generates SQL query 
        • Validates and refines the output

        Using this approach the team was able to dramatically boost the relevance and accuracy of the responses. We can see this workflow in Figure 2, and the query writer agent has four main steps, retrieving the context, ranking, writing the query, and fixing the query. This process can loop multiple times based on any errors or hallucinations.

        Chen et al. (2025) Figure 2. Text-to-SQL for Enterprise Data Analytics.  
        https://arxiv.org/abs/2507.14372 

         
        Lastly, in Figure 3, we see the user interface and high-level architecture for the chatbot type experience the team developed. Using a “Classifier”, the chatbot is able to define the user’s intent, and route the prompt to the appropriate agent.  

        Chen et al. (2025) Figure 3. Text-to-SQL for Enterprise Data Analytics. https://arxiv.org/abs/2507.14372 

        Evaluation: Measuring Quality in Real Enterprise Conditions

        To gauge the success of the results, Chen et al. (2025) used the following performance metrics and criteria:

        • 133 benchmark questions across 10 product domains
        • Multiple valid SQL answers per question
        • Human judges and LLM judges
        • A 1–5 scoring system, where 4–5 = acceptable or perfect

        Key Takeaways

        1. Data organization matters more than the model
          • Clustering, metadata, table popularity, and historical queries are as important as model choice.
        2. Agentic architecture dramatically boosts performance
          • A multi-stage workflow elevates accuracy by 20%-50%.
        3. Multi-agent feedback loops reduce hallucinations
          • Retrieve → Rank → Write → Fix patterns are essential for correctness.

        Final Thoughts

        The research from Chen et al. (2025), demonstrates what it takes to operationalize text-to-SQL in large enterprises. It also reveals an important truth about enterprise AI:

        If an AI system isn’t working well, the problem may not be the LLM, it may be the data and the context you’re feeding it.

        Achieving reliable results requires far more than pointing an LLM at a database. It demands carefully structured context, well-organized data, knowledge graphs that encode enterprise semantics, and agentic workflows designed to retrieve, rank, write, and iteratively fix SQL. In the end, LLMs will not replace enterprise data architecture, they will amplify it. Most importantly, it is the teams that invest in organizing their data, designing agentic systems, and creating user-centric interfaces who will be the ones who unlock the true potential of Gen AI for analytics.

        Until next time!

          References

          Chen, A., Bundele, M., Ahlawat, G., Stetz, P., Wang, Z., Fei, Q., Jung, D., Chu, A., Jayaraman, B., Panth, A., Arora, Y., Jain, S., Varma, R., Ilin, A., Melnychuk, I., Chueh, C., Sil, J., & Wang, X. (2025, July 18). Text-to-SQL for Enterprise Data Analytics. arXiv. https://doi.org/10.48550/arXiv.2507.14372