100% Mutation Efficacy, Zero Mutants Killed: A Gate That Asserts Nothing

The headline read effective_test_efficacy = 100.0000%. Underneath: mutants_killed = 0, mutants_lived = 0, mutants_timed_out = 603. The perfect score was arithmetic — (killed + timed_out) / (killed + lived + timed_out) = 603 / 603 = 100%. Not one of the 603 mutants in the security-critical packages was caught by a test assertion firing. The gate was about to merge as a quality improvement.

Can a quality gate report 100% and assert nothing at the same time?

Yes — and this one did. A mutation gate scores (killed + timed_out) / total, so a run of all timeouts and zero kills scores a flat 100% while proving no test ever caught a fault. The perfect number carries no test-quality information at all; it is the loudest possible signal that the harness never once watched an assertion fire.

The number that should have stopped the merge

The same package had killed 20 mutants three months earlier; the new harness regressed that to zero while still reporting perfect. A gate that catches nothing reports the same score as a gate that catches everything, and that is the trap.

Quick refresher: mutation testing deliberately breaks your code — flips a comparison, deletes a return — then runs your tests. If a test fails, the mutant is killed: your tests would catch that bug. If the tests still pass, the mutant lived: a real gap. The score measures whether your tests would detect faults.

A timeout is neither killed nor lived. It means the mutated process ran past a time limit — here, three times the baseline — before any test could render a verdict.

So 603 timeouts and 0 kills is not "perfect tests." It is the count of times the harness gave up before any test could weigh in — and the scoring formula folds every one of those into the 100%. The metric tells you nothing about the tests, and yet it reads perfect.

A gate that asserts nothing is worse than no gate

The gate is confidently useless: it occupies the slot where a real signal should be, and it shows green.

Picture the regression this gate exists to catch. A future change deletes the assertions from a security package's tests but leaves the test bodies in place. The code still runs — just nothing checks the result. Under this harness, those mutants still time out, so effective efficacy stays 100%, the "no survivors" floor stays satisfied, and the gate stays green. The exact "tests run but catch nothing" failure the gate was built to detect is invisible to it.

It's worse in the other direction too. The gate's one real tooth is the floor that says zero mutants may survive. But if the harness times out indiscriminately, a genuine survivor that happens to run slow gets misclassified as a timeout instead of a survivor — so it never trips the floor. The gate is blind to precisely the survivors it was supposed to catch.

A green gate that asserts nothing is indistinguishable from no gate. The difference is that the team trusts the green one.

The history made it a regression, not a quirk

This wasn't always the score. The package's own history showed a healthy run three months earlier: killed = 20, lived = 0, timed_out = 310. Twenty genuine kills. The current run: killed = 0, timed_out = 603. The genuine-kill count dropped from twenty to zero while timeouts roughly doubled.

That is the difference between "fail-closed code naturally produces timeouts" — a hand-wave with no per-mutant evidence — and "something in the harness rewrite stopped detecting anything." A healthy run of well-tested security code is dominated by kills. A run with zero kills and 600-plus timeouts is a broken meter, not a clean bill of health. The baseline note asserted the timeouts were fine. It never showed why a single mutant timed out instead of failing fast.

Filed-as-resolved is not solved

Here is the meta-lesson, and it's the one that generalizes past mutation testing. The work item behind this gate was on track to be marked resolved. The governance scaffolding was excellent — JSON reporting, a committed baseline, registration in the strictness budget, a promotion path. The work was about 80% there.

The missing 20% was the only part that mattered: proof that the gate actually signals. There was a unit test asserting the scoring script fails when handed a synthetic surviving-mutant report. There was no proof that the real harness on the real packages ever produces a survivor the gate catches — because the only real run produced zero. A gate's unit tests passing tells you the scorer works; it does not tell you the gate would catch a regression on live code.

Filed-as-resolved is not solved. A closure that enshrines a degenerate run as the project's quality floor isn't a quality improvement — it's a green light wired to nothing. The correct close requires a seeded-survivor proof: deliberately introduce a fail-open in one security package, run the gate, and show it surfaces the mutant as lived and exits red. Until that red-then-green demonstration exists, the gate is asserted, not demonstrated — and the work item stays open.

When a timeout-dominated run is acceptable

The honest caveat: not every timeout is a broken meter. Mutation testing on genuinely slow code — heavy crypto, network round-trips, deliberate fail-closed paths that block rather than return — will produce timeouts that are real, and chasing each one to a clean kill can cost more than the signal is worth. If you have per-mutant evidence that a given timeout is the code doing its slow-but-correct job, classifying it as "not a survivor" is defensible.

What is never acceptable is the inference this gate made: timeouts in bulk, with no per-mutant evidence, treated as equivalent to kills, and a history of real kills silently replaced by them. The line is the evidence. A timeout you can explain mutant-by-mutant is a measurement; a 100% built from 603 unexplained timeouts is a guess wearing a perfect score. And there is a cheaper move than arguing about the score at all: if a package's tests are too slow to mutation-test honestly, say so and exclude it explicitly — an excluded package is visible, a degenerate one is not.

The rule I would staple to every gate before it ships: do not close gate work on a green run — close it on a run you watched go red first. Seed a survivor, prove the gate catches it, then let it pass. A gate's one job is to be red when something is wrong; this one inverted that purpose while reporting a perfect score.

That closes the series. Five failure classes, one shape: the code compiled, the tests passed, the score read perfect — and the verification still missed the thing it was built to catch. Part 1 was a git flag that swept sibling work into the wrong commit. Part 2, four reviews that checked what was present and never what was missing. Part 3, a rewrite that passed every test while silently dropping operations. Part 4, code that existed but was never wired in. This one is the last and the quietest: the gate that watches it all and reports nothing wrong, because nothing was ever checked. AI moved the bottleneck from writing code to verifying it. Everything in this series broke in that gap, and every fix was the same move — make the green prove itself.

A gate you have only ever seen green is a gate you have never seen work.

Series linkage

Part 5 of 5 (final) — "AI-Native Verification Failure Classes":

git-footgun-parallel-agents — staging corruption under shared-index parallelism.
review-tunnel-vision-positive-space-negative-space — reviews that check what's present, never what's absent.
abstraction-collapse-ai-rewrite-trap — the whole missing layer a perfect-looking rewrite never built.
code-exists-not-wired — code that compiles, tests, and reviews clean but is never called.
mutation-gate-100-percent-zero-kills (this piece) — a quality gate reporting 100% while catching nothing.

About AmanERP

AmanERP is an AI-native ERP for SMBs, built in public behind quality gates we have watched go red before trusting their green. Aman means peace — calm software, honestly measured. www.amanerp.com