Database Context Compression for Text-to-SQL on Real-World Large Databases
arXiv:2606.28601v1 Announce Type: cross Abstract: Recent progress in Text-to-SQL has been driven by stronger language models and prompting strategies, yet performance on real enterprise benchmarks such as Spider 2.0 and BIRD remains far below that on classical academic datasets. We argue that the...
The Bottleneck Is Not the Model, It's the Database
A new preprint from arXiv tackles a persistent blind spot in Text-to-SQL research: the assumption that language models can gracefully handle the full schema of large, real-world enterprise databases. The paper, "Database Context Compression for Text-to-SQL on Real-World Large Databases," identifies a fundamental mismatch between academic benchmarks and production environments. While models like GPT-4 and Claude achieve impressive results on curated datasets like Spider, their performance collapses on benchmarks like Spider 2.0 and BIRD, which feature databases with hundreds of tables and thousands of columns.
The core problem is context window saturation. Modern LLMs can process tens of thousands of tokens, but enterprise schemas are often far larger. Including the full schema as prompt context forces the model to sift through irrelevant tables and columns, diluting attention on the few elements actually needed for a given query. The authors argue that this "schema noise" is a primary driver of the performance gap between academic and real-world settings.
Why This Matters for Enterprise AI
This research addresses a critical operational hurdle. Many organizations have invested in Text-to-SQL pipelines only to find that their complex, normalized databases produce unreliable results. The paper's proposed solution—database context compression—is not about building a better model, but about smarter prompt engineering. By dynamically selecting only the relevant subset of a schema based on the user's natural language query, the approach reduces token usage, lowers latency, and improves accuracy.
For AI practitioners, this shifts the focus from model selection to data architecture. The insight is that an uncompressed, full-schema prompt is not just inefficient—it is actively harmful. The model's ability to reason about SQL is degraded when it must first filter out thousands of irrelevant column names. This aligns with broader findings in retrieval-augmented generation (RAG): providing too much context can be worse than providing too little.
Implications for Practitioners
First, schema pruning should be a standard preprocessing step in any production Text-to-SQL system. Techniques like embedding-based table retrieval, foreign-key traversal, or even simple keyword matching can dramatically reduce the context size before the LLM ever sees the prompt.
Second, benchmark selection matters. Relying solely on academic datasets like Spider gives a misleading picture of model capability. Real-world deployments must test against schemas that mirror enterprise complexity—dozens of tables, cryptic column names, and sparse documentation.
Third, cost and latency are not just engineering concerns—they are accuracy concerns. A prompt that requires 40,000 tokens for schema context is not only expensive; it likely produces worse queries than a compressed prompt using 4,000 tokens. The paper reinforces that thoughtful context management is a performance multiplier.
Key Takeaways
- Enterprise Text-to-SQL fails not from model weakness but from schema overload — large database contexts drown out relevant information, degrading LLM reasoning.
- Context compression is a practical, model-agnostic fix — dynamically selecting relevant schema elements improves accuracy, reduces cost, and lowers latency.
- Academic benchmarks are poor proxies for production — real-world databases with hundreds of tables expose failure modes that curated datasets hide.
- Prompt design must treat schema as a retrieval problem — treating the entire database as context is counterproductive; targeted schema retrieval is essential for reliable Text-to-SQL.