Evals Are the Highest ROI for AI Products: The Complete Workflow from Error Analysis to LLM Judge


Evals Are the Highest ROI for AI Products: The Complete Workflow from Error Analysis to LLM Judge

(Summary from Lenny’s podcast) https://www.youtube.com/watch?v=BsWxPI9UM4c

🧠 Why “Doing Evals” Is Becoming a Core Skill for AI Products

The clearest message from this conversation: to build great AI products, you need to be good at evals (evaluations). This isn’t a “nice-to-have”—it’s high-ROI work that can directly determine product success or failure. The host mentioned that two years ago he had never heard the word “evals”; now it’s one of the most frequent hot topics on the show—even the CPOs of both Anthropic and OpenAI say: evals are becoming the most important new skill for product builders.

Even more interesting: some of the fastest-growing companies in the world are essentially “building, selling, and providing evals for AI labs.” The host mentioned he just interviewed CF Mercor (name preserved as stated), making him feel “something big is happening.”

The guests Hamel Hussein and Shreya Shankar are key figures in pushing evals from “niche and mysterious” to “required AI product skill.” They teach an evals course on Maven that’s the #1 ranked course on the platform; they’ve taught 2,000+ PMs and engineers across 500 companies, including many members of OpenAI and Anthropic teams, plus nearly every major AI lab.

The host’s opening endorsement: this episode is the deepest yet most accessible evals introduction he’s heard, even making him excited to “want to write an eval” despite not having a product that needs one.


📌 What Exactly Are Evals? Not Just “Unit Tests”

The conversation starts by grounding “what is an eval” very practically.

1) Hamel’s Definition: Systematically Measuring and Improving AI Applications

Hamel says evals are a way to systematically measure and improve AI applications. It doesn’t need to be scary—essentially treating your LLM application as a system, doing data analysis, building metrics when necessary, so you can iterate, experiment, and improve.

2) A More Concrete Intuition: From “Guessing” to “Having Feedback Signals”

He gives an example: you built a “real estate assistant” app and might encounter:

  • Emails written not in the style you want
  • Tool calling errors
  • Various unexpected errors

Without evals, you can only:

  • Change the prompt, then “hope it doesn’t break something else”
  • Rely on vibe checks—fine at first, but quickly loses control as the product grows

The value of evals: they give you a feedback signal to follow, letting you improve the product with confidence instead of guessing.

3) The Host’s Analogy: Are Evals Like Unit Tests?

The host imagines evals as “unit tests for LLMs.” Shreya’s response is key: this analogy isn’t entirely wrong, but it’s too narrow.

She says evals are a broad spectrum:

  • Unit tests are just a small part: checking “non-negotiable” functionality
  • But AI assistants often handle open-ended tasks, so you also need to evaluate fuzzier aspects, like:
    • Facing entirely new types of user requests
    • New user segments (distribution shift / new cohorts)
    • Long-term tracking of user feedback (e.g., thumbs up, engagement)
    • Periodically reviewing data to find new problems

Conclusion: unit tests are part of evals, but evals ≠ unit tests.


🧪 “Show, Not Tell”: They Demo Evals Starting Points with Real Product Data

An important feature of this episode: not abstractly explaining concepts, but directly looking at product logs.

1) Case Company: Nurture Boss (Property Management AI Assistant)

Hamel shows a company he’s worked with, Nurture Boss, an AI assistant serving “apartment property managers,” handling:

  • Inbound leads
  • Customer service
  • Booking appointments
  • Multi-channel interactions: chat / text / voice
  • Many tool calls (booking tours, checking availability, etc.)
  • RAG retrieval (fetching resident, property info)

Hamel emphasizes it’s great for teaching because it contains typical complexity of “modern AI applications.” He also mentions the data has been anonymized (e.g., apartment names changed to Acme Apartments), thanking them for allowing its use for teaching.

2) Observability Tools and Traces: First You Need to “See” What the System Is Doing

Hamel loads logs in an observability tool (he uses Braintrust, but emphasizes the tool doesn’t matter, also mentioning:

  • Phoenix Arize (used in their blog post version with the host; the host mentions Amon’s article also uses Phoenix Arize)
  • LangSmith These are all viable options.

The screen shows a trace of an interaction: a sequence of recorded events (engineering term). He says the trace concept has existed for a long time, but is especially important for AI applications because systems are composed of multiple components, tools, retrievals, etc.

Example 1: Real System Prompt

He shows a system prompt (the host is surprised this is usually a company’s “crown jewels”), including:

  • You are the Acme Apartments leasing team AI assistant
  • Primarily responding to text messages
  • Goal is to provide accurate, helpful information
  • Plus various guidelines for tour scheduling, applications, different persona speaking styles, etc.

Example 2: User Asks About 1BR with Study, AI Response Is Suboptimal

User: “Do you have a 1BR with study? I saw it on virtual tours” LLM:

  • Calls tool to check individual’s information
  • Checks community availability AI responds: “There are several 1BRs, but none specifically listed with a study. Here are some options…” User follows up: “When will one with a study be available?” AI: “I currently don’t have specific information about 1BRs with study availability.” User: “Thanks” AI: “You’re welcome, feel free to ask if you have more questions.”

Hamel asks the host: for a lead management product, is this good? Host says “not ideal.” Hamel adds: many people think AI being “honest” about not knowing is correct, but from a product perspective, it should “hand off to a human” or provide more proactive guidance. So the trace note says: “should have handed off to a human.”


📝 First Key Step: Error Analysis = Open Coding

This section is the episode’s core: they emphasize most people make mistakes from the start—jumping straight to writing tests.

1) Why Can’t You Just “Write Tests”?

Hamel says a common trap is “immediately writing tests,” but for LLM applications this usually isn’t the best starting point because:

  • LLM systems are more “stochastic,” with larger surface area
  • You need to first use data to “ground” yourself: where exactly is it breaking, where does it break most often

So you should first do error analysis: sample the data, take notes on each trace.

2) How to Do Open Coding? Principle: Only Write “The Most Upstream, First Error You See”

Hamel says the approach is actually quite “chill”:

  • Sample first, don’t need to see all logs
  • Wear your product hat while viewing each trace
  • Only write down the first error you see (the most upstream error), then stop and move to the next trace
  • The first few might be painful, but you’ll quickly get faster
  • Every observability tool basically allows writing notes

He even says: “Everyone who does it gets addicted.” Because you learn product truths in very little time.

3) Open Coding Demo: Various “Real World Messiness”

He shows different types of errors in succession, presenting “error diversity”:

Example 3: Text Messages Getting Fragmented, Conversation Flow Is Janky

User messages get split into short fragments, even mid-word, and AI doesn’t know how to respond. Hamel says this is more of a technical issue (channel characteristics), can write note: conversation flow is janky because of text message (he uses “janky” very colloquially, which the host also notices as “very informal but works”).

Example 4: AI Hallucination: Says They Offer Virtual Tours, But They Don’t

User asks about 2-3BR and whether there are virtual tours. AI response not only gives apartment prices but says “we offer virtual tours.” But the reality: there are no virtual tours at all. So record: hallucinating something that doesn’t exist.


🤖 Common Misconception Debunked: Can You Have AI Do Error Analysis For You?

Host asks: can LLMs replace humans looking at traces?

Shreya’s answer is very direct: for free form note-taking, this isn’t something LLMs are good at. Because:

  • LLMs will likely say “this trace looks fine”
  • They lack product context: e.g., the virtual tour example, LLM might judge “answered well,” but humans know the feature doesn’t exist

She clarifies: future or current parts might use LLM assistance, but at least for now, open coding shouldn’t be delegated to LLMs.


👑 “Benevolent Dictator”: Avoiding Committee Paralysis

This is a very memorable concept from the interview (host says he loves it): benevolent dictator.

Hamel explains: many teams get stuck in “everyone discusses together, meetings, consensus” quicksand during open coding; but in many situations this is completely unnecessary. What you need is:

  • Designate one person whose taste you trust
  • Keep the process affordable and sustainable
  • Otherwise costs get too high, eventually no one does it, which means losing

Who should be this person?

  • Domain expert for sure
    • Legal → legal counsel / lawyer
    • Mental health → psychiatrists and such experts
    • This apartment leasing case → someone who understands leasing business
  • Often also the Product Manager (PM)

The host adds: this person might feel it’s unfair (why am I the dictator), but “it’s benevolent”—the point is moving forward quickly, not pursuing perfection.


🔎 How Many Traces Are Enough? 100 Traces, Theoretical Saturation

For “how many traces to review?” they give two levels of answers.

1) Operationally: Start with 100, So You Don’t Get Mentally Stuck

Hamel says they recommend at least 100 traces, not because 100 is a magic number, but because:

  • By 20 traces, you usually already feel it’s super useful and want to continue
  • Saying “100” helps people psychologically overcome the “endless” fear: you know to just finish this batch first

2) Theoretically: Until “You Stop Learning New Things”

Shreya adds the academic term: theoretical saturation, meaning:

  • When viewing more won’t produce new types of notes/concepts
  • Or won’t substantially change what you’ll do next
  • Beginners not having intuition is normal; after a few rounds you’ll build the feeling—some might saturate at 40, 60, or even 15 traces, depending on the product and your experience level

🧩 From Open Codes to Axial Codes: Start Using AI for “Organization and Categorization”

Once you’ve accumulated many open codes (free-form notes), the next step is turning chaos into actionable structure.

1) What Are Axial Codes? = Category Labels for Failure Modes

Shreya explains clearly:

  • Open codes aren’t 100 independent problems—many are just different ways of saying the same type of problem
  • You shouldn’t force a taxonomy during open coding
  • Axial code can be understood as category labels for “failure modes”
  • Goal is to cluster problems into “a few categories,” finding the most common/important ones to attack

2) This Step Can Use LLM Help, But Keep Humans in the Loop

Hamel demos exporting notes to CSV, feeding it to AI tools for analysis. He mentions using three AI tools for classification before recording:

  • Claude (he shows Claude parsing CSV, producing categories)
  • ChatGPT (same prompt works)
  • Julius AI (he likes it because it follows a notebook/Jupyter data science style)

He also mentions a detail: he uses “open codes” and “axial codes” terminology directly in prompts, because these concepts come from social science and have existed for a long time—LLMs know these terms and can help you quickly align on the task.

Claude produces a batch of categories (e.g., capability limitations, misrepresentation, process/protocol violations, human handoff issues, communication quality, etc.). But Hamel says:

  • Some are too broad (e.g., capability limitations is too unactionable)
  • Humans need to intervene to rename and make categories more actionable

📊 The Most Basic Yet Powerful: Count + Pivot Table, Turning “Chaos” into “Priorities”

They emphasize an almost “counterintuitive but true” point: the strongest analysis technique is often just basic counting.

The workflow is:

  1. Compile your preferred axial codes into a list (Hamel even demos using Excel formulas to make codes comma-separated)
  2. Use spreadsheet AI features (he uses Gemini in Sheets) to auto-categorize each open note into an axial code
    • Important detail: open codes must be specific enough
    • You can’t just write “janky”—then AI (or even humans) won’t know which kind of janky
  3. Set up a “none of the above” category (Shreya suggests), to detect incomplete taxonomy, prompting you to add new categories or rewrite existing ones
  4. Make a pivot table counting frequency of each category

Results (in their demo data):

  • conversational flow issues appeared 17 times Hamel says pivot tables are super useful—you can double-click to trace back to all original data for that category.

But he also reminds: frequency doesn’t necessarily equal priority. Some low-frequency issues might be higher risk and should be fixed first.

He also points out: not every problem needs an eval. Formatting errors might just be unclear prompts—fix the prompt directly; some checks can be done with pure code (no LLM needed), because it’s cheaper.


🧑‍⚖️ Two Types of “Automated Evaluators”: Code-based vs LLM-as-a-Judge

Here they split “automated evaluation” very clearly.

1) Code-based Evaluator: Use Code When You Can

Shreya says instead of debating “does eval equal some specific form,” think of it as: automated evaluator.

Examples suitable for code:

  • Is output valid JSON
  • Does it comply with Markdown
  • Is it too long/short
  • Does format match specifications These can all use Python functions or rule-based approximations—cheap and stable.

2) LLM-as-a-Judge: For “Hard to Express in Code” Narrow Problems

For more subjective, context-dependent failure modes, use LLM as judge, but note two principles:

  • Narrow scope: judge only one failure mode at a time (e.g., “should handoff have occurred”)
  • Binary output: pass/fail, true/false They strongly oppose 1-5 or 1-7 Likert scales because:
  • They easily become “avoiding decisions”
  • 3.2 vs 3.7 on a report—no one knows what that means
  • Also leads people to lose trust in evals

Shreya adds: LLM judge isn’t as hard as you think—the judge only does one thing, output is binary, it can be very reliable. It’s not just for CI/unit tests, but also for production monitoring: daily sampling of 1000 traces to run the judge, tracking failure rate.


✅ LLM Judge Isn’t Done Once Written: Must Do “Human Annotation Alignment” and Confusion Matrix

This section is their key solution to “why people get burned by evals.”

1) Hamel’s Handoff Judge Prompt

He shows a handoff judge prompt (he says LLM can help generate it, but humans must review and edit). This prompt instructs the judge to output true/false, listing when handoff should occur, such as:

  • User explicitly requests human
  • User is ignored or stuck in loop
  • Policy requires transfer
  • Sensitive resident issues
  • Tool data unavailable
  • Same-day walk-in or tour requests, etc.

2) Biggest Landmine: Treating Judge as Gospel, Deploying Directly

Hamel warns: many people write a judge and “believe it,” thinking whatever judge says is wrong must be wrong. This makes evals misaligned with reality, eventually the team loses trust, and “anti-eval” sentiment is born.

So you must do: judge alignment Method: compare judge’s decisions with humans (your annotations from axial coding).

3) Don’t Just Look at “Agreement Percentage”: It Can Be Misleading

Hamel says only looking at agreement is dangerous because errors are often in the long tail. Suppose a certain error is only 10%—if judge always says “pass,” it can still have 90% agreement—looks high but completely useless.

Correct approach: look at the confusion matrix, focusing on:

  • human=false, judge=true (false positives)
  • human=true, judge=false (false negatives) I.e., the off-diagonal error cells in the matrix. If someone just says “75% agreement is OK” without a matrix and without iterating the prompt to reduce errors, that’s a “bad smell.”

🧾 “Evals Are the New PRD?”—Their Answer: Similar, But More Dynamic, From Real Data

The host makes a powerful observation: this judge prompt looks like a PRD (Product Requirements Document):

  • Explicitly lists behavior requirements
  • Can run automatically and continuously
  • Almost like executable specs for “how the product should respond”

Hamel agrees, but Shreya adds a deeper point: you discover requirements from data that you couldn’t have imagined beforehand. Meaning:

  • Don’t lock in all “expectations” before seeing data
  • Because you don’t know what failure modes will look like
  • You’ll keep revising “what counts as good”

This also echoes Shreya’s research.


📄 Research Support: Shreya’s Paper “Who Validates the Validators”

Hamel specifically introduces a research report: “Who Validates the Validators” by Shreya and collaborators (he calls it “one of the coolest research pieces if you want to understand evals”).

Shreya explains their user study from late 2023: having developers try to write LLM judges or validate outputs. Key finding is criteria drift:

  • You can’t fully define the rubric from the start
  • Experts only think of new failure modes after seeing 10 outputs
  • “What’s good/bad” changes as you see more data This is why this wave of AI development is “especially hard”: not because ML/AI is new, but because the output space and requirement understanding are dynamically evolving.

🧰 When Are Evals “Deployed”? Unit Tests + Production Monitoring + Dashboard

They describe “what comes next” very practically:

  • Put LLM judge into unit tests/CI: run on every code change, ensuring previously seen failure cases don’t recur
  • Do production monitoring: daily sampling running judge, tracking failure rate
  • Build dashboard: long-term visualization of specific failure mode performance

Hamel adds a very realistic observation: many companies actually do this systematically but don’t talk about it publicly because this is moat. If you’re an email writing assistant, this “how to make quality better” methodology and data assets aren’t something you want competitors to easily copy.


⚔️ Why Are Evals So Controversial? They Think It’s Two Main Reasons

The host mentions seeing lots of “drama” on X (Twitter), with people saying don’t do evals, or just rely on vibes.

Shreya’s view is “reconciliatory”: people aren’t necessarily on opposing sides—more often it’s because:

1) Everyone’s Definition of “Evals” Is Too Narrow, Talking Past Each Other

Some think evals only means unit tests; some think only LLM judge; some think it excludes product metrics/monitoring. Different definitions lead to arguments.

2) Many People Have Been Burned by “Poorly Done Evals”

Typical burning:

  • Using Likert scale judges
  • Not doing alignment and confusion matrices
  • Later discovering judge differs greatly from human expectations → lose faith in evals → become anti-eval

Shreya says she totally empathizes, and she’s also “anti-Likert scale LLM judge.”


🧑‍💻 “Claude Code Says They Don’t Do Eval, Just Vibes”—How They Respond

The host brings the controversy to a specific event: someone (host says possibly Claude Code head engineer) went on a podcast saying “we don’t do eval, we just vibe.” So are evals really necessary?

Shreya’s response has two layers:

  1. Claude Code stands on many evals’ shoulders Because Claude’s (fine-tuned) model itself was extensively evaluated on various coding benchmarks, with publicly visible evaluation results. Can’t deny there’s an eval system behind it.

  2. They’re likely doing “invisible evals” Even without using the word “eval,” they might be doing:

  • User and interaction metrics monitoring (how many users, conversation length, etc.)
  • Internal dogfooding
  • Problems go into queue, feedback to responsible person These are essentially error analysis and evaluation.

Hamel adds an important distinction: coding agent context is special because:

  • Developers themselves are domain experts
  • Using it all day (strong dogfooding)
  • Can “see the output code” to quickly judge quality So its eval process can be more shortcut; but don’t generalize this context to healthcare, legal, etc. (doctors won’t tolerate “AI giving wrong advice” as dogfooding).

🧪 “AB Testing vs Evals” Is a Straw Man Argument: AB Test Is Also a Type of Eval

They believe “AB tests vs evals” itself is confused.

  • AB test needs metrics to compare, and metrics are essentially evals
  • Their definition of evals is: systematically measuring quality, so AB test is naturally included

But Shreya reminds: many people do AB tests too early, based on “imagined requirements” rather than first doing error analysis. The right way is:

  • First find real failure modes from data (like text message fragmentation, weird handoff issues)
  • Then use them to form hypotheses and AB tests Otherwise AB tests might be testing a bunch of unimportant or wrong-direction things.

🏢 Host Follow-up: What Does OpenAI Acquiring Statsig (A/B Tool) Mean?

Host asks if Statsig being acquired by OpenAI means AB test is more important. Hamel says he doesn’t know the inside story, just speculates it might be strategic; but what he wants to emphasize more is:

  • Labs have always done evals (MMLU, HumanEval for foundation models, etc.)
  • OpenAI even analyzes Twitter sentiment and Reddit complaints, trying to pull external signals back to product improvement
  • Hopes future labs will emphasize more “product-specific” evals, not just generic benchmarks (because handoff issues aren’t related to math problem scores)
  • Many labs’ eval products are still stuck at “cosine similarity, hallucination score” generic tools—long-term this isn’t enough, needs more data science-style error analysis processes

Shreya even half-jokes: “We shouldn’t be the only two people on earth pushing this structured method, it’s absurd.”


❌ Three Most Common Misconceptions

Hamel mentions the two most common:

  1. “Buy a tool, plug it in, AI will eval for me” He says this is the biggest misconception: everyone wants this so badly that there are sellers in the market, but it “doesn’t work.”

  2. Not looking at data He says when consulting, clients often come with problems, but when he says “let’s look at traces” they freeze. As long as you actually look at traces, 100% you’ll learn a lot, and often find problems directly.

He adds a 3rd point: there’s no single right answer for evals; there are many wrong methods, but also more than one right method. You need to design approaches based on product stage, resources, and context; but regardless, “some form of error analysis” is almost indispensable.


✅ Two Most Important Implementation Tips: Goal Is Improving Product, Not Making Eval Look Pretty

Shreya’s advice focuses on mindset:

  • Don’t fear imperfection: eval’s goal isn’t perfection, but “actionably improving the product”
  • Use LLM throughout the process: to organize thoughts, improve PRD, write open codes back into better requirement documents But the key is: don’t let AI replace you (still need humans in the loop)

Hamel’s advice is more tool-oriented:

  • Since “looking at data” has highest ROI, use AI to help minimize “data viewing friction” He shows screenshots of Nurture Boss’s internal tool interface, including:
    • voice/email/text split by tab
    • Thread list
    • System prompt hidden by default (improves readability)
    • Axial coding error count visualization (red showing counts) He says these interfaces can be built in a few hours (easier with AI help), and “one size fits all” is hard, so just build what fits your needs best.

⏱️ How Long Will This Take? One Week to Start, Then 30 Minutes Weekly

Host asks the most realistic question: how long for the first time?

Shreya gives her rhythm:

  • Initial: 3-4 days, doing multiple rounds of error analysis, lots of labeling, until you can make a spreadsheet like Hamel’s and produce some LLM judge evaluators
  • Ongoing: hook into tests, write scripts, set cron job to run weekly Maintenance time might be just: 30 minutes per week She admits she’s “data hungry” and often spends more time out of curiosity, but it’s not required.

🎉 This Process Is Actually “Fun”: The Joy and Cruelty of Wearing the Product Hat

Hamel shares a story from the previous day looking at client data: an app that auto-sends recruiting emails. Opening the trace, they see the email starts with “Given your background, …” He immediately says:

  • “I hate these emails, I see ‘Given your background’ and delete”
  • Client initially felt AI got the name, links, info right—it’s “correct”
  • But from product perspective, it’s “very generic, canned” feeling He uses this story to emphasize: the fun of error analysis comes from using product taste to challenge “seemingly correct” outputs.

💡 The Real Claim of This Interview (All Details in One Sentence)

Their entire method isn’t promoting some tool or mysterious engineering—it’s pulling “AI product development” back to a repeatable fundamentals process:

  • First look at traces (observation)
  • Do open coding (manual, domain expert, benevolent dictator)
  • Use LLM for axial coding (organizing, not replacing judgment)
  • Use basic count/pivot table to find priorities
  • Use code for evaluators when you can, not LLM
  • When necessary do LLM-as-judge, but must have binary output, do alignment and confusion matrix
  • Put evaluators in CI and production monitoring, creating improvement flywheel
  • Don’t pursue perfect eval, pursue actionable product improvement

And all controversies (vibes vs evals, AB tests vs evals) in their eyes are mostly definition misunderstandings or getting burned by doing it badly. Truly mature teams, whether they call it evals or not, ultimately do some form of systematic quality measurement—the difference is just: do you have methodology, can it scale, can it be trusted.