method · engineering · qa · product · origin: AI-native practice, ~2023 onward · upd 2026-06-11

Eval-driven development EMERGES

Executable definitions of correct — the discipline the whole engine turns on.

What it was for

Nothing — that is the point. Evaluation engineering has no pre-AI ancestor because it answers a problem that did not exist: how do you maintain quality when the producer works at a speed no reviewer can match and with a confidence no tone can betray?

The verdict

EMERGES — and sits closer to the centre of the engine than any inherited practice. Evaluations are the load-bearing half of the gear mesh: TDD's logic generalised beyond code correctness to product behaviour, tone, safety, cost, and judgment calls that used to live only in a senior reviewer's head. They are how the inner loop verifies its own revolutions, and how human standards get enforced at machine speed.

What changes

Everything downstream of "how do we know it's right?" The craft has its own emerging stack — golden datasets, graded rubrics, regression suites for behaviour rather than functions — and its own failure modes: evals that overfit, evals that drift, evals that measure the measurable instead of the important. Writing them well is applied epistemology with a build pipeline, and it is the most defensible new skill on the board.

The strongest objection

Not everything that matters can be made executable; chasing total coverage produces metric theatre. Fully conceded — evaluations bound the space where machines may move freely; human judgment governs the rest. Knowing where that boundary sits is itself the senior skill.

Clarified: evaluation is not itself new — acceptance tests, fuzzing, golden datasets, canary analysis, contract tests, and SRE error budgets all predate the engine. What emerges is evaluation as the dominant control plane for machine-produced work: executable judgment over behaviour, tone, safety, and cost, maintained as a living asset rather than a release gate.

Falsification: downgrade to TRANSFORMS if what teams adopt turns out to be existing test and observability practice rescaled, rather than a new authoring discipline.

Related: Code review · Manual QA as a phase · Test-driven development

EVAL-DRIVEN DEVELOPMENT
DWG NO: TSE-LDG-EVAL-DRIVEN-
REV: 1.1 · 2026-06-11
SCALE: 1 : N
The Two-Speed Engine · from The Product GuyEN-IN · changelog forthcoming · ratio 1 : N