181 Tests Passed. Zero Features Worked.

181 tests passed. Zero features worked.

Both were true on the same build: 97 unit tests and 84 regression tests, all green. Underneath that green bar, the database had no tables, real tokens had been rejected for four sprints, and two of three observability metrics emitted nothing.

The tests told the truth about the functions. They did not tell the truth about the system.

What the Green Tests Proved

The pass rate was real. Every unit test measured exactly what it was built to measure: given this input, does this function produce this output? The answers were correct. The functions worked.

What those functions connected to was never tested. PostgreSQL, the authentication provider, and the metrics path were replaced by small structs returning predetermined values. The mocks never opened a real connection, fetched a JWKS endpoint, or proved a Prometheus counter was called on a real request. They returned the values the test expected, and the tests recorded a pass.

Three failures in the running system were invisible to all 181 tests simultaneously:

The database had zero tables. Migrations had not been applied to the fresh environment. The server started, the connection pool opened, and the health endpoint returned {"status":"ok"} — because the pool connects when the database exists, not when the schema exists. Every mock test bypassed this entirely; mocks do not care whether the table they are pretending to query actually exists.

Authentication had been broken for four sprints. The identity provider uses ECDSA with the P-384 curve for token signing. The mock OIDC server used RSA-2048. The gateway's JWK struct had fields only for RSA keys (N, E); EC fields (X, Y, Crv) were never defined. Go's JSON unmarshaler does not warn on missing fields — the EC key data deserialized silently to zero values. Every mock test used RSA tokens and passed. Every real token was rejected at the signing-method check: unexpected signing method: ES384.

Two of three Prometheus metrics were dead code. The rate limiting feature worked correctly against real infrastructure, but two metric counters — defined, registered, never called — emitted nothing. The mock store returning a predetermined count does not exercise the code path that records the metric.

One curl command against the real JWKS endpoint in Sprint 2 would have caught the authentication failure: curl -s http://localhost:3001/oidc/jwks | jq '.keys[0].kty' returns "EC". It was never run.

The Confidence Gap Is in the Label

A mock-only suite is not automatically bad. It is just a smaller claim than teams often make with it.

Mocks serve a real purpose: they isolate function-level logic from infrastructure, making tests fast, deterministic, and runnable without real services. Keep that. The problem starts when "the mocked component behaved" gets reported as "the feature works."

Mocks prove business rules at the component level. They say nothing about whether SQL queries compile against the actual schema, whether sqlc types round-trip correctly through pgxpool, whether the JWT algorithm your mock uses matches the algorithm your identity provider defaults to, or whether a metric counter is called on the code path that handles real requests.

That gap is where this class of failure lives. Wiring bugs, schema mismatches, algorithm assumptions, and dead observability code are not logic errors. They are integration errors. A suite that never integrates cannot see them.

Renaming mock-based tests "regression tests" compounds the damage. The label implies the tests catch regressions in the real system. They catch regressions in the mocked simulation of the system. These are different objects.

The Fix Is One Layer, Not a Rewrite

The correction is additive: keep the unit tests, add the layer they cannot reach.

A single integration test that starts the actual server, connects to a real database, executes one authenticated request, and checks the relevant metric would have changed the outcome. The missing schema fails at query time. The ECDSA mismatch rejects the token. The metric assertion proves the code path records what the dashboard claims it records.

That is smaller than the cost of finding these failures late. Four sprints of broken authentication, discovered by a human running a manual check, is an expensive way to learn that the mock was too friendly.

The change that prevents recurrence is a definition of done that distinguishes integration coverage from mock coverage. "All tests pass" is only meaningful if it says what kind of tests passed. Require at least one integration test exercising real infrastructure per feature, and the gap closes before it becomes an incident.

The Test Boundary to Remember

Every test level has a boundary. That boundary is not a defect; it is part of the design. The defect is forgetting where it is.

Unit tests cover function-level logic. They do not cover SQL query correctness against a real schema, connection pool behavior, JWT algorithm compatibility, metric emission, proxy path prefixes, or environment variable presence.

Mock-based acceptance tests cover business rules at the component level. They do not cover the type mappings sqlc generates, how pgxpool handles JSONB serialization, whether the real identity provider uses the algorithm your mock assumes, or whether a handler is registered at the correct route.

Integration tests cover real infrastructure contracts. They do not cover multi-tenant isolation at scale, failure modes under load, browser rendering, or end-user flows.

Each layer is necessary. None is sufficient alone. The failure mode in the 181-test example is not that unit tests exist — it is that nothing else does.

When a mock-only test is the right call

None of this is an argument against mocks. For a pure business-logic unit — a tax calculation, a state machine, a validation rule — a mock-isolated test is faster, deterministic, and exactly right; reaching for real infrastructure there is wasted ceremony. The trap is the label: a mock-only suite called "100% green," read as a statement about the system instead of about its components.

A green test suite is not evidence the system works. It is evidence the system satisfies the specific properties the tests were designed to check. Know the properties. Name the gaps. Test both.

About AmanERP

AmanERP is an AI-native ERP for SMBs, built in public — where "the tests pass" and "the system works" are kept as separate claims, on purpose. Aman means peace: business software that feels calm instead of chaotic. www.amanerp.com