AmanERPAmanERP
Back to blog
VERIFICATION

Evidence Beats Assertion in AI Engineering

The sentence to delete from every agent workflow is simple: it should work. Replace it with the command that proves it.

Niraj Kumar2026-06-165 min read
A verification console where command output, exit codes, and screenshots are pinned above a blocked completion stamp.

The most expensive sentence in an AI engineering session is it should work. It sounds harmless. It usually appears after code was changed, a mental model was satisfied, and the agent is ready to move on. It is also the moment where verification is most likely to be skipped.

Engineering does not run on confidence. It runs on evidence. For AI agents, that distinction has to be explicit because language models are very good at producing fluent status and very bad at feeling the social cost of being wrong.

What counts as evidence

Evidence is not a summary. Evidence is the command, the output, the exit code, the screenshot, the trace, the query result, or the file diff that a reviewer can inspect. If a test passed, name the test command and the result. If a page renders, capture the browser state. If a migration is safe, cite the migration lines and the compatibility check.

The agent does not need to paste every line of a 2,000-line log. It does need to report the command that ran, whether it exited zero, and the failure line if it did not. Without that, the user is being asked to trust the agent's conclusion instead of the system's evidence.

Separate gates that fail separately

One common error is gate conflation. A local test passing is not CI passing. A build passing is not live validation. A linter passing is not a runtime flow working. Each gate answers a different question, and the report should keep those questions separate.

  • Local tests: does this code satisfy the local behavioral checks?
  • Build: does the application compile and bundle?
  • CI: does the branch pass the required remote checks?
  • Live validation: does the actual deployed or browser-driven flow work?

A green row in one column must not imply green in another. The format should make uncertainty visible: passed, failed, skipped, pending, or not run. Ambiguity is where false confidence hides.

What to change Monday

Ban completion claims without fresh evidence. When an agent says done, ask what command proves that. If there is no command, artifact, or inspection that would catch the wrong outcome, the work is not verified. It may be implemented, but those are different states.

Evidence is not ceremony. It is how a team keeps the trust model sane when a non-human worker can produce confident prose at machine speed.

Series linkage

Part 4 of 10 in Prompt Library to Operating System. After orchestration, the next control is proof: every claim needs an artifact.

About AmanERP

AmanERP treats green checks, review notes, and operational evidence as product infrastructure. Calm systems are observable systems.