Eval-driven development → EMERGES

What it was for

Nothing — that is the point. Evaluation engineering has no pre-AI ancestor because it answers a problem that did not exist: how do you maintain quality when the producer works at a speed no reviewer can match and with a confidence no tone can betray?

The verdict

EMERGES — and sits closer to the centre of the engine than any inherited practice. Evaluations are the load-bearing half of the gear mesh: TDD's logic generalised beyond code correctness to product behaviour, tone, safety, cost, and judgment calls that used to live only in a senior reviewer's head. They are how the inner loop verifies its own revolutions, and how human standards get enforced at machine speed.

What changes

Everything downstream of "how do we know it's right?" The craft has its own emerging stack — golden datasets, graded rubrics, regression suites for behaviour rather than functions — and its own failure modes: evals that overfit, evals that drift, evals that measure the measurable instead of the important. And the surface it has to span is everything a good reviewer once held in their head at once — not just functional correctness but security, performance, accessibility, maintainability, cost, and comprehensibility — each reduced to a check that runs rather than a quality you eyeball. Writing them well is applied epistemology with a build pipeline, and it is the most defensible new skill on the board.

The strongest objection

Not everything that matters can be made executable; chasing total coverage produces metric theatre. Fully conceded — evaluations bound the space where machines may move freely; human judgment governs the rest. Knowing where that boundary sits is itself the senior skill.

Clarified: evaluation is not itself new — acceptance tests, fuzzing, golden datasets, canary analysis, contract tests, and SRE error budgets all predate the engine. What emerges is evaluation as the dominant control plane for machine-produced work: executable judgment over behaviour, tone, safety, and cost, maintained as a living asset rather than a release gate.

Falsification: downgrade to TRANSFORMS if what teams adopt turns out to be existing test and observability practice rescaled, rather than a new authoring discipline.

Corroboration — "set the constraints" (Osmani, 2026)

Addy Osmani reaches this verdict from the harness side: "when code generation scales beyond review, quality … has to live somewhere else. It moves into the harness, environment and operating system around the agent." The checks he names are this entry's stack exactly — unit, property, acceptance and mutation tests, plus quality metrics — reframed as back-pressure: constraints that "resist bad work before it becomes somebody else's problem." Two things worth keeping. The taxonomy is explicit and composable — correctness, maintainability, security, performance (his diagram adds accessibility, cost, and comprehensibility), each a gate with its own checks, wired into a single exit gate where only output that clears every gate ships. And it is a rate-limiter, not a wall: you hand a loop only as much autonomy as you can cheaply verify. The same claim as the verdict, reached independently — evaluation is the control plane, and it lives in the harness around the agent, not the review after it. (Osmani, "Set the constraints around your agents," 2026; harness thesis in "Agent Harness Engineering," Apr 2026.)