Why AI evals are the hottest new skill for product builders
Summary of Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar .
most important concepts (quick view)
- Evals = systematic measurement + improvement of an AI product using your own logs/traces, not just benchmarks.
- Start with manual error analysis (read real traces, write one concise note per trace), then cluster notes into failure modes and count them to prioritize.
- Prefer code-based evaluators; use LLM-as-judge only for narrow, subjective failure modes — and make the judge binary (pass/fail), avoid Likert scales (1-5 ratings).
- Validate the judge against human labels (confusion matrix), then run evaluators in CI and on production samples (weekly/daily) to watch drift.
- Treat evals as living PRDs: explicit behavior rules you enforce continuously.
core summary
- What evals are: a lightweight, repeatable way to quantify how well your AI app behaves in the wild and to improve it without relying on “vibes.”
- Begin with data, not tests: sample ~40–100 traces, write one “open code” (the first upstream error) per trace, and stop at theoretical saturation (when new issues stop appearing).
- Synthesize failure modes: cluster open codes into a handful of actionable axial codes (e.g., “missed human handoff,” “hallucinated feature,” “conversation flow break”) and count them to pick the top problems to fix first.
- Fix obvious issues fast: prompt/format/engineering bugs may not need evaluators.
- Automate evaluators where it matters:
- Code-based checks (cheap/deterministic) for structure/format/guardrails.
- LLM-as-judge for a single, specific failure mode with a binary verdict.
- Align the judge to humans: compare judge vs human labels; iterate the judge prompt until disagreements shrink (especially on rare errors). Don’t trust raw “% agreement” alone.
- Operationalize: run evaluators in unit tests/CI and on real production samples to monitor and catch regressions and drift with ~30 minutes/week maintenance.
- Relationship to PRDs & A/B: evaluators become enforceable PRDs; A/B tests complement them by measuring product/business impact at runtime.
a simple 6-step checklist (you can run this week)
- Sample traces: pull 60–100 recent, diverse conversations/runs.
- Open code: write one short note per trace (first upstream error only).
- Axial code + count: cluster notes into 5–8 failure modes; pivot/count to prioritize.
- Quick fixes first: repair obvious prompt/UX/engineering issues immediately.
- Add evaluators: code-based where possible; add 1–3 binary LLM-judges for subjective modes.
- Validate + automate: align judges to human labels, then run in CI + nightly/weekly prod sampling; review dashboards weekly.
implementation tips
- Keep judges binary; avoid “1–5” ratings. They’re slower, fuzzier, and harder to interpret.
- Label only the first upstream error per trace to stay fast and consistent.
- Create a “none of the above” bucket while clustering; it reveals missing failure modes.
- Small team? Appoint a benevolent dictator (domain expert) to make final labeling calls quickly.
- Build a minimal data review UI (or use existing observability tools) to remove friction from trace review.
glossary
- Likert scale: a psychometric response scale with ordered options (e.g., 1–5 from “strongly disagree” to “strongly agree”). In evals, avoid such graded scores for judges; prefer binary pass/fail to make results crisp, comparable, and automatable.
- Open coding: free-form note taking on traces to capture observed issues without a predefined taxonomy.
- Axial coding: grouping open-code notes into a small set of actionable failure modes used for counting and prioritization.
- Theoretical saturation: the point in analysis when reviewing more traces stops producing new issue types—your cue to move on.
- LLM-as-judge: using a model to make a narrow, binary evaluation about a specific failure mode (e.g., “Should this have been handed to a human? TRUE/FALSE”).
- Code-based evaluator: a deterministic check written in code (e.g., JSON validity, length limits, schema adherence).
- Benevolent dictator: a single domain expert empowered to make fast, consistent labeling and taxonomy decisions.
minimal judge prompt (example, binary)
Given the full trace, output only TRUE if a human handoff should have occurred; else FALSE. Handoff required if any of: (1) user explicitly requests a human; (2) policy-mandated topics; (3) sensitive complaints/escalations; (4) missing/failed tool data; (5) same-day tour scheduling. Return exactly
TRUEorFALSE.
references & further reading
- Why AI evals are the hottest new skill for product builders - YouTube
- Lenny’s Newsletter post
- Shankar et al., Who Validates the Validator? Criterion Drift in Eval Rubrics for LLMs (UIST 2024) arXiv preprint
- Building eval systems that improve your AI product (guest post): Lenny’s Newsletter article