What it was for
Nothing — that is the point. Evaluation engineering has no pre-AI ancestor because it answers a problem that did not exist: how do you maintain quality when the producer works at a speed no reviewer can match and with a confidence no tone can betray?
The verdict
EMERGES — and sits closer to the centre of the engine than any inherited practice. Evaluations are the load-bearing half of the gear mesh: TDD's logic generalised beyond code correctness to product behaviour, tone, safety, cost, and judgment calls that used to live only in a senior reviewer's head. They are how the inner loop verifies its own revolutions, and how human standards get enforced at machine speed.
What changes
Everything downstream of "how do we know it's right?" The craft has its own emerging stack — golden datasets, graded rubrics, regression suites for behaviour rather than functions — and its own failure modes: evals that overfit, evals that drift, evals that measure the measurable instead of the important. Writing them well is applied epistemology with a build pipeline, and it is the most defensible new skill on the board.
The strongest objection
Not everything that matters can be made executable; chasing total coverage produces metric theatre. Fully conceded — evaluations bound the space where machines may move freely; human judgment governs the rest. Knowing where that boundary sits is itself the senior skill.
Clarified: evaluation is not itself new — acceptance tests, fuzzing, golden datasets, canary analysis, contract tests, and SRE error budgets all predate the engine. What emerges is evaluation as the dominant control plane for machine-produced work: executable judgment over behaviour, tone, safety, and cost, maintained as a living asset rather than a release gate.
Falsification: downgrade to TRANSFORMS if what teams adopt turns out to be existing test and observability practice rescaled, rather than a new authoring discipline.