Simon Willison Jul 5, 2026 8/10 signal

Better Models: Worse Tools

tool-useagentic

What happened

Armin Ronacher reports that newer Anthropic models (Opus 4.8, Sonnet 5) perform worse on Pi's custom edit tool schema than older models. The newer models hallucinate extra, invented fields matching the search-and-replace schema used by Anthropic's own Claude Code tool. This suggests that RL training to optimize models for proprietary developer tools has degraded their ability to adhere to arbitrary, user-defined tool schemas.

Why it matters

RLHF bias toward proprietary tool schemas (like Claude Code) can actively degrade a model's performance on custom tool schemas.

The take

This is a critical warning for agent builders. Over-tuning models on specific tool formats (like Claude Code's search/replace or OpenAI's apply_patch) creates a model-tool alignment bias. If you build custom coding agents, you may have to adapt your tool schemas to match the proprietary formats the frontier models were RLHF'd on, rather than expecting the models to flexibly follow your custom JSON schemas.

Do this

If your custom coding agent's tool calls are failing on newer models, consider refactoring your tool schemas to mimic the native search-and-replace format used by Claude Code or OpenAI's patch tools.

Read the source →

Don't read this site daily. Get it in your inbox.

The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.