ai pm · craft

Build an eval harness in an afternoon (no engineering)

Updated Jun 2026 Calibrated to the strong-hire bar

Evals are the breakout AI PM skill in 2026, and nearly every piece of prep treats them abstractly. This page shows you how to actually build one this afternoon, so you can speak from experience in an eval-design interview and own quality from your first week on the job.

The 2026 reframe first: feasibility is essentially free. The model can usually do the task. Your job as a PM is to define “lovable” precisely enough to measure it and honestly enough to gate a ship on it. An eval harness is the operational artifact of that definition. It is not an engineering deliverable. It is your answer to “what does good mean here?” written in a form that compounds over time. The eval set is the IP, not the prompt.

Step 1: Build a golden set

Start with 50 real inputs from production logs, support tickets, or staged user sessions. Each row needs five columns: input, expected output, actual output (blank until you run the eval), score, and notes. That is the whole schema.

Here is what a concrete row looks like for a support agent:

InputExpected outputActual outputScoreNotes
”My subscription renewed but I didn’t get access to the new features”Acknowledge the issue, check entitlement status, restore access or escalate to billing if access cannot be restored in-session”Please contact billing at…“0.3Skipped the entitlement check; escalated prematurely
”Can I transfer my account to another email?”Confirm it is possible, explain the steps, warn about active sessions”Yes, go to Settings > Account > Transfer”1.0Correct, concise, no hallucination

And for a coding assistant:

InputExpected outputActual outputScoreNotes
”Write a Python function to deduplicate a list while preserving order”Working function using dict.fromkeys() or equivalent, no extra explanation unless askedCorrect function plus a four-paragraph essay on deduplication approaches0.6Correct but over-suggested; faithfulness failure

Weight the set deliberately. A useful composition:

  • 60% core scenarios: the bread-and-butter tasks your feature handles most of the time
  • 25% edge cases: uncommon inputs that reveal brittle behavior
  • 15% adversarial inputs: attempts to jailbreak, confuse, or get the model to hallucinate

Most PMs make their golden sets too easy. The painful tail is where the product actually fails users. Grow toward 100-200 cases over time, but 50 well-chosen rows beats 200 random ones.

No-code tools for this step: Google Sheets or Airtable work fine. Rows, not code.

Step 2: Define 2-4 metrics that match your product

For a support agent: resolution accuracy (did the agent actually solve the problem), hallucination rate (did it invent a policy or product feature that does not exist), and escalation appropriateness (when it cannot resolve, did it hand off correctly). For a coding assistant: correctness (does the code run), faithfulness (does it match the user’s intent), and over-suggestion rate (does it pad the response with unrequested changes).

Avoid measuring what is easy instead of what matters. Format checks and latency are worth tracking, but they are not the eval. If your only metrics are structural, your eval is covering for a harder conversation about what “good” actually means.

Step 3: Pick a grader and calibrate it before you automate

Use exact match where there is a binary right answer. For open-ended outputs, use an LLM-as-judge with a written rubric: one quality criterion described in 1-2 plain English sentences per dimension. This is the G-Eval approach used by Confident AI and DeepEval. You describe what you want, and the judge scores it. No code required.

But do not automate the judge before calibrating it. Have 2-3 people independently score the same 20 outputs against your rubric. Then look for divergence. Where human scorers disagree, your rubric is ambiguous. Fix the rubric before you trust the automated judge to run at scale. This one step is what separates a PM who has built an eval from one who has only read about them.

Three-layer metric stack in practice:

  1. Automated format checks (fast, cheap, run every time)
  2. LLM-as-judge on 3-4 dimensions (the core of your harness)
  3. Weekly human audit of a random sample to catch judge blind spots

Tools by situation: Braintrust has a no-code UI and works well for teams already in experiment-tracking workflows. Confident AI is the fastest starting point for PM-led evals with no engineering help. Langsmith (from LangChain) pairs tracing with evals, useful when you need to see the full chain of calls, not just inputs and outputs. Promptfoo is open-source and YAML-driven: minimal setup, strong for prompt comparison, slightly more setup than the others. For the golden set itself, Google Sheets or Airtable is fine until you need automation.

Step 4: Set a launch bar and document who owns it

Green, yellow, and red are product decisions, not engineering ones.

  • Green: all dimensions meet thresholds, no regressions vs. the previous version. Ship.
  • Yellow: core scenarios pass, edge cases are below threshold but monitored. Ship with caveats.
  • Red: any safety failure, core scenarios below threshold, or significant regression. Do not ship.

The threshold is yours to set and defend. For a support agent, you might draw the line at 85% resolution accuracy and zero tolerance on fabricated policy language. Document the number, who signed off, and what “red” triggers in practice. If you cannot defend why you chose the threshold, the number is not doing its job.

You do not need to write evaluation code. You must own the eval strategy: deciding what to measure, building the test dataset, setting pass/fail thresholds, and interpreting the results to make ship decisions. That ownership is the PM’s job, regardless of who runs the tooling.

Step 5: Close the loop (this is the compounding asset)

This step gets one bullet on most pages. It deserves its own section.

Every production failure that reaches a real user is a gap in your golden set. When something goes wrong in production, add that input to the golden set with the failure mode labeled. The eval then catches that class of failure on every future prompt change. The set becomes harder over time. The model that passes your eval in six months has cleared a bar the model could not clear at launch.

This is the flywheel: production failures feed the test set, the test set gates future ships, future ships produce fewer failures. A PM who owns this loop owns the quality trajectory of the product, not just a snapshot of it.

What interviewers actually ask and what clears the bar

At AI-native companies (Anthropic, OpenAI, Glean, Cursor, Perplexity), interviewers can distinguish a PM who has built an eval from one who has only read about evals within the first 90 seconds. The tell is in the specifics.

When asked “walk me through how you’d build an eval for our support agent,” the strong answer names: the composition of the golden set and why it weights toward edge cases, the specific metrics chosen and why (not a generic list), the calibration protocol before automating the judge, the launch bar and who signed off on it, and how production failures feed back into the golden set.

strong

"I'd start with a golden set: 50 real support tickets with the correct resolution labeled, weighted 60/25/15 toward core cases, edge cases, and adversarial inputs. I'd define three metrics: resolution accuracy, hallucination rate (specifically fabricated policy language), and escalation appropriateness. For grading: exact match on binary questions, an LLM-as-judge with a written rubric for open-ended responses. Before I trust the automated judge, I'd have two teammates independently score the same 20 outputs, then fix any rubric language where our scores diverged. I'd set a launch bar: above 85% resolution accuracy, zero tolerance on fabricated policy. I'd document who signed off on that threshold. Then I'd close the loop: every production failure gets added to the golden set so the eval gets harder over time, not stale."

weak

"I'd define what good looks like, build a test dataset, and use an LLM-as-judge to score outputs." This names the right concepts but cannot answer how to weight the test set, how to calibrate the judge before automating it, or how to set and defend a launch bar. It signals the candidate has read about evals but has not built one, which is exactly what the interviewer is probing for.

What not to do

  • Edit a prompt without re-running evals. This is the most common PM mistake and it is invisible until production breaks.
  • Pick a variant “by feel.” If you have an eval, use it. If you do not run it, you do not have one.
  • Let the golden set go stale. An eval built at launch and never updated reflects the product you shipped, not the one you are running today.
  • Measure only format and latency. These are table stakes, not quality.
  • Run evals only before launch. For active products, weekly is the minimum cadence. High-traffic or high-risk workflows may need daily checks.

Why this is the moat

When the model is a commodity, the eval is what you own. It is your definition of “good,” your safety gate, and your flywheel. A PM who has built one reasons about AI quality concretely. A PM who has not waves at “it seems good.” That gap is audible in seconds in an interview and visible in weeks on the job.