Shipping an LLM-powered feature without an evaluation framework is the ML equivalent of deploying without tests. A prompt change that “seems fine” in dev introduces a regression in a tone edge case you didn’t test. A model upgrade improves average quality but degrades on a specific task segment. Without evals, you find out from users.
The three-layer eval stack
1. Unit evals
Deterministic assertions on known inputs. If the output for a specific input should always contain a citation, assert it. If it should never start with “I”, assert that. These run in CI on every prompt change and take seconds.
2. Model-graded evals
For qualities that can’t be asserted deterministically (helpfulness, tone, factual grounding), we use a judge model with a rubric. The judge prompt is versioned alongside the application prompt. These are slower and noisier, but catch regressions unit evals miss.
3. Human evals
A sample of real production outputs, rated by a small panel on a defined rubric. We run these before any major prompt change or model upgrade. They’re expensive but they’re ground truth.
The golden dataset
The foundation of all three layers is a golden dataset of 200–500 examples: real user inputs, expected output characteristics, and known failure cases. Building this dataset is the hardest and most important work in LLM evaluation. It compounds over time — every production failure becomes a new golden example.
Making deployment decisions
We set a threshold: a change must not regress more than 2% on the golden dataset and must not introduce any new failures on a list of “never fail” inputs. If it clears both, it ships.