A Microsoft ISE team evaluated multiple AI agent approaches for converting natural language questions to SQL queries on poorly documented, messy databases. They tested GitHub Copilot CLI (with Claude Sonnet 4.5 and Gemini 3.0), Microsoft Agent Framework (GPT-5 Mini), and Azure Databricks AI/BI Genie, achieving up to ~75-80% accuracy. Key findings: runtime query execution is essential (removing it dropped accuracy to 38%), schema metadata and domain hints significantly boost performance, and model choice matters (Claude Sonnet 4.5 outperformed GPT-5 Mini by ~11 points). The primary remaining failure mode is business logic errors—semantic misunderstandings that require domain expertise rather than technical fixes. Practical takeaways include starting with schema documentation and runtime validation, designing evaluation criteria early, and budgeting for iterative domain expert review.

13m read timeFrom devblogs.microsoft.com
Post cover image
Table of contents
Introduction Copy linkResearch Foundation Copy linkDataset Copy linkApproach and Solution Copy linkEvaluation Methodology Copy linkExperiments Copy linkFindings Copy linkConclusion Copy link

Sort: