SQLCoder: a state-of-the-art LLM for SQL generation

  •   SQLCoder, an open-source product by Defog, converts natural language questions into SQL queries.
  •   It surpasses the performance of many open-source models, even edging out models like gpt-3.5-turbo and text-davinci-003 which are 10 times its size.
  •   You can test SQLCoder using the provided interactive demo.

Technical Details

  •   SQLCoder is a 15B parameter Language Learning Model (LLM) that's a refined version of StarCoder.
  •   It's optimized for hand-crafted SQL queries of varying complexity.
  •   On certain individual database schemas, SQLCoder rivals or even surpasses GPT-4 in performance.


  •   Over the past three months, enterprises in healthcare, finance, and government have used SQLCoder.
  •   The primary advantage: it can be self-hosted, ensuring sensitive data stays on the server.
  •   The release is Defog's way of contributing back to the community, given they built upon existing models like StarCoder.


  •   Defog crafted a unique dataset centered on text-to-SQL tasks derived from 10 varied schemas. An additional evaluation dataset was produced from 7 new schemas.
  •   The dataset's complexity was ensured by selecting intricate schemas comprising 4-20 tables.
  •   Each question was categorized based on difficulty, using a method inspired by the Spider dataset.
  •   The model fine-tuning process was split into two stages, beginning with simpler questions, leading up to the more complex ones.


  •   Assessing the accuracy of SQL queries is inherently tricky due to multiple valid solutions for a single query.
  •   Therefore, Defog had to create a unique framework to gauge the correctness of SQL queries. They've open-sourced this framework and the accompanying dataset.


  •   SQLCoder excels against all notable models, save for GPT-4, based on Defog's evaluation mechanism.
  •   Especially, it bests some models that are much larger in size.
  •   For specific database schemas, its performance and responsiveness match or surpass OpenAI's GPT-4.

Future Prospects

  • Defog plans to enhance SQLCoder by:
  • Incorporating more curated data and broader questions.
  • Utilizing advanced training techniques like Reward Modeling and RLHF.
  • Introducing a specialized model for data analysis combining SQL and Python.


  The model can be explored and tested via Defog's interactive demo.

