BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

ICLR 2026 Conference SubmissionAnonymous Authors
InteractiveText-to-SQLLLMCode Generation
Abstract:

Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short of capturing this complexity, either by treating conversation histories as static context or by limiting evaluation to narrow, read-only (SELECT-ONLY) operations, thereby failing to reflect the challenges encountered in production-grade database assistant. In this work, we introduce BIRD-INTERACT, a benchmark that restores this missing realism through: (1) a comprehensive interaction environment that couples each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision; (2) two evaluation settings reflecting real-world interaction settings which contain a pre-defined conversational protocol (c-Interact) and a more open-ended agentic setting (a-Interact) in which the model autonomously decides when to query the user simulator or explore the DB environment; (3) a challenging task suite that covers the full CRUD spectrum for both business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks, requiring LLMs to engage in dynamic interaction. The suite is organized into two sets: a full set (BIRD-INTERACT-FULL) of 600 tasks which unfold up to 11,796 dynamic interactions for a comprehensive overview of performance and a lite set (BIRD-INTERACT-LITE) of 300 tasks, with simplified databases for detailed behavioral analysis of interactions, and fast development of methods. Our empirical results highlight the difficulty of BIRD-INTERACT: the most recent flagship model GPT-5 completes only 8.67% of tasks in the c-Interact setting and 17.00% in the a-Interact setting on the full task suite. Further analysis via memory grafting and Interaction Test-time Scaling (ITS), validate the importance of effective interaction for achieving success in complex, dynamic text-to-SQL tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces BIRD-INTERACT, a benchmark for multi-turn interactive text-to-SQL that couples databases with hierarchical knowledge bases, metadata, and a function-driven user simulator. It resides in the 'Conversational Benchmark and Evaluation Frameworks' leaf alongside three sibling papers: CoSQL, Dynamic Multi-turn SQL, and another conversational evaluation framework. This leaf contains four papers total within a taxonomy of 46 papers across 18 leaf nodes, indicating a moderately populated but not overcrowded research direction focused specifically on multi-turn conversational evaluation rather than single-turn or execution-only methods.

The taxonomy reveals neighboring branches addressing related but distinct challenges. 'Proactive Ambiguity Detection and Question Generation' (four papers) focuses on pre-generation clarification, while 'Multi-Turn Agentic and Reinforcement Learning Approaches' (three papers) explores agent-based long-horizon tasks. 'Conversational Dialogue Systems and Interfaces' (four papers) emphasizes end-to-end dialogue management rather than benchmark design. BIRD-INTERACT bridges evaluation infrastructure with interaction modeling, distinguishing itself from execution-guided refinement branches that automate correction without user involvement and from post-generation feedback systems that rely on explicit user corrections after SQL generation.

Among 15 candidates examined through limited semantic search, none clearly refute the three identified contributions. The comprehensive interaction environment (nine candidates examined, zero refutable) and dual evaluation settings of c-Interact and a-Interact (six candidates examined, zero refutable) show no substantial prior overlap within this search scope. The function-driven user simulator received no candidate examination, suggesting either novelty or insufficient search coverage in that specific dimension. These statistics reflect a constrained literature search rather than exhaustive field coverage, indicating that within the examined top-15 semantically similar papers, no direct precedents emerged.

Based on the limited search scope of 15 candidates, the work appears to occupy a distinct position within conversational text-to-SQL benchmarking, particularly through its integration of knowledge bases, metadata, and autonomous user simulation. The taxonomy structure confirms this sits in a moderately active but not saturated research direction. However, the analysis does not cover broader benchmark literature outside the top-15 semantic matches, and the zero-refutation finding reflects search limitations rather than definitive novelty claims across the entire field.

Taxonomy

Core-task Taxonomy Papers
46
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multi-turn interactive text-to-SQL with dynamic user clarification. The field addresses scenarios where natural language queries are inherently ambiguous or incomplete, requiring systems to engage users in clarifying dialogues before generating accurate SQL. The taxonomy organizes research into several main branches: Interactive Clarification and Ambiguity Resolution focuses on detecting and resolving unclear user intent through targeted questions; Multi-Turn Conversational Text-to-SQL examines how systems maintain context across dialogue turns, with benchmarks like CoSQL[14] and Dynamic Multi-turn SQL[17] establishing evaluation frameworks; User Feedback-Based Query Correction explores mechanisms for incorporating explicit corrections; Execution-Guided Refinement and Validation leverages query results to iteratively improve outputs; Query Generation and Refinement Techniques develops core algorithmic advances; Domain-Specific and Application-Oriented Systems tailors solutions to particular use cases; and Surveys and Comprehensive Reviews synthesize broader trends, as seen in Conversational Text-to-SQL Survey[18] and LLM Information Systems Survey[8]. Recent work reveals contrasting strategies for handling ambiguity and multi-turn interaction. Some approaches emphasize proactive clarification, such as AmbiSQL[7] detecting ambiguous queries and Expected Information Gain[19] optimizing question selection, while others like Fisql Interactive Feedback[3] and SQLucid[4] focus on refining queries through iterative user feedback loops. Execution-guided methods, including Execution Feedback Reasoning[23] and ExCoT[24], validate generated SQL against database results to trigger refinement. BIRD-INTERACT[0] situates itself within the conversational benchmark branch alongside CoSQL[14] and Dynamic Multi-turn SQL[17], providing a framework for evaluating how well systems handle dynamic clarification across multiple turns. Compared to BIRD-INTERACT LLM[35], which explores LLM-specific strategies on the same benchmark, BIRD-INTERACT[0] emphasizes the broader evaluation infrastructure. The interplay between proactive ambiguity detection, feedback incorporation, and execution validation remains an active area, with open questions around balancing user burden against query accuracy.

Claimed Contributions

BIRD-INTERACT benchmark with comprehensive interaction environment

The authors develop a new benchmark featuring an interactive environment that includes databases, hierarchical knowledge bases, metadata files, and a function-driven user simulator. This environment enables models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision, addressing limitations of static conversation transcripts in existing benchmarks.

9 retrieved papers
Two evaluation settings: c-Interact and a-Interact

The authors propose two distinct evaluation modes: c-Interact tests models' ability to follow structured conversational protocols, while a-Interact evaluates autonomous planning where models decide when to query users or explore the database environment. These settings reflect different real-world interaction scenarios for database assistants.

6 retrieved papers
Function-driven user simulator with two-stage approach

The authors introduce a two-stage user simulator design where an LLM first parses clarification requests into predefined symbolic actions (AMB, LOC, UNA), then generates responses based on these actions and annotated ground-truth SQL. This approach prevents ground-truth leakage and ensures predictable, controllable simulator behavior while maintaining context-aware interactions.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

BIRD-INTERACT benchmark with comprehensive interaction environment

The authors develop a new benchmark featuring an interactive environment that includes databases, hierarchical knowledge bases, metadata files, and a function-driven user simulator. This environment enables models to solicit clarifications, retrieve knowledge, and recover from execution errors without human supervision, addressing limitations of static conversation transcripts in existing benchmarks.

Contribution

Two evaluation settings: c-Interact and a-Interact

The authors propose two distinct evaluation modes: c-Interact tests models' ability to follow structured conversational protocols, while a-Interact evaluates autonomous planning where models decide when to query users or explore the database environment. These settings reflect different real-world interaction scenarios for database assistants.

Contribution

Function-driven user simulator with two-stage approach

The authors introduce a two-stage user simulator design where an LLM first parses clarification requests into predefined symbolic actions (AMB, LOC, UNA), then generates responses based on these actions and annotated ground-truth SQL. This approach prevents ground-truth leakage and ensures predictable, controllable simulator behavior while maintaining context-aware interactions.