Simon Willison Jul 2, 2026 8/10 signal

Using DSPy to evaluate and improve Datasette Agent's SQL system prompts

evalagentictool-use

What happened

Simon Willison used Claude Code (running Claude Fable 5) to orchestrate an asynchronous task: using DSPy to evaluate and optimize the SQL system prompts for Datasette Agent (using GPT-4.1 mini and nano). The evaluation revealed a specific failure mode: a prompt instruction telling the agent not to call `describe_table` if it already had information caused it to guess column names blindly when only table names were provided in the schema listing.

Why it matters

Systematic evaluation using DSPy reveals non-obvious failure modes in agentic tool use that manual prompt tweaking often misses.

The take

This is a great practical example of using DSPy for systematic prompt evaluation rather than manual "vibes-based" engineering. It also highlights how agentic loops can fail due to conflicting prompt constraints (e.g., trying to minimize tool calls leading to hallucinated schemas).

Do this

Read the full post to see how to set up DSPy to evaluate SQL agent prompts, and review your own agent prompts to ensure "efficiency" constraints aren't causing tool-bypass hallucinations.

Read the source →

Don't read this site daily. Get it in your inbox.

The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.