Hamel Husain Jun 29, 2026 ★ 9/10 signal

“It’s Hard to Eval” Is a Product Smell

evalagentic

What happened

Hamel Husain argues that difficulty in evaluating AI products is a fundamental product design flaw, not just an engineering hurdle. If an LLM or agent output cannot be easily verified by an eval, it cannot be easily verified by the user either. He advocates for designing products to output 'checkable artifacts' (such as intermediate steps, source data, precise definitions, or code) rather than just raw answers. This makes verification easier for both human users and automated evaluation pipelines.

Why it matters

Designing AI systems to produce verifiable intermediate artifacts solves both user trust and automated evaluation bottlenecks at the same time.

The take

This is a crucial paradigm shift for AI PMs and engineers. Instead of treating evaluation as a post-hoc testing problem, it must be treated as a core product design constraint. By forcing agents to produce structured, intermediate artifacts, you solve both the UX trust problem and the automated eval problem simultaneously.

Do this

Review your current LLM or agentic workflows: if you are struggling to write evals for a complex output, redesign the system to expose intermediate, checkable artifacts (like SQL queries or source citations) to both the eval harness and the user.

Read the source →

Don't read this site daily. Get it in your inbox.

The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.