Agentic Engineering, Part 6: Forensic Verification — Why a Perfect Score From Your AI Agent Should Make You Nervous

I have a confession to make. When I first set up the verification pass for my AI-generated knowledge store, I expected it to find problems. Lots of them. Broken links, inconsistent terminology, contradicting SLOs — the usual documentation entropy that happens when 44 files are generated in sequence.

Instead, it found zero issues. And that result taught me something more important than any bug would have.

Forensic verification overview showing magnifying glass over 44 documents with 5 verification categories

What Is a Forensic Verification Pass?

Think of it as a consistency audit for your entire knowledge store. While unit tests check individual pieces of code, a forensic verification pass checks relationships between documents — the kind of cross-cutting concerns that no single test can catch.

In my Knowledge Store Generator (from Part 2), Step 10 is the final step. The Product Architect persona returns after Steps 1-9 have generated all 44 documents, and performs a systematic cross-examination:

Cross-Reference Integrity (47 links checked): Every time Document A references Document B, verify that Document B exists, contains the referenced section, and uses the same terminology.

Terminology Consistency (10 domain names): Pick every domain name, persona name, and component name. Search for it across all 44 files. Confirm it is spelled and capitalized identically everywhere.

Numerical Consistency (7 SLO chains): Every number that appears in more than one document must match. If RELIABILITY.md says “< 10 minutes per job” and the product spec says “< 600 seconds per job,” that is technically consistent but flagged for clarity.

Traceability Verification (6 chain types): For each of the traceability chains described in Part 5, walk both ends and confirm the connection is real, not just assumed.

PRD Coverage (20 requirements): Every requirement from the original Product Requirements Document must appear in at least one design doc, product spec, or architecture section.

The Surprising “Zero Issues” Result

The verification report came back clean:

20/20 PRD coverage
0 broken links out of 47
0 terminology fixes needed
0 numerical inconsistencies
All 6 traceability chains intact

My first reaction: great, the process works perfectly. My second reaction, after reading industry research: wait — this is suspicious.

CodeRabbit’s data shows AI-generated code has 1.7x more issues than human code and 3x more readability problems. Martin Fowler’s team found that “almost right” is the most common AI failure mode — 66% of developers cite it as their biggest frustration. A perfect score from a self-verifying system is a yellow flag, not a green flag.

The Self-Grading Problem

Here is the fundamental issue, and it is one that every Agentic Engineering practitioner needs to internalize: the same agent that generated the documents is the one verifying them.

It is like grading your own exam. You tend to be generous. You might not catch subtle issues because you made the same assumptions during generation and verification. The patterns you are blind to during writing, you remain blind to during review.

The research is clear on this point. Anthropic’s best practices and OpenAI’s team both recommend using separate agent sessions for writing versus reviewing. A reviewing agent has clean context and will not be biased toward content it generated.

This is why the role-based persona approach from Part 3 is necessary but not sufficient. Different personas within the same session share the same context and biases. True independent review requires a separate session.

What the Verification Pass Does Catch

Despite the self-grading limitation, the forensic pass catches a category of issues that no other mechanism can: cross-document inconsistencies.

Unit tests verify individual components. Integration tests verify interactions between components. But neither checks whether your architecture document and your product spec agree on how many domains exist, or whether your SLO targets match across the reliability doc and the test assertions.

These cross-document issues are especially dangerous because they create silent misalignment. The agent reads one document when building Module A and a different document when building Module B. If those documents disagree on a fundamental assumption, the modules will work individually but fail together.

The verification pass is the only mechanism that reads all 44 documents and checks for this kind of systemic consistency. It is the knowledge store equivalent of integration testing — and just as important.

How I Am Improving It

Based on the research findings, I am making three changes to bring this in line with Agentic Engineering best practices:

1. Independent Reviewer: Step 10 will be split into a separate prompt executed in a fresh Claude Code session. The reviewer reads all 44 files cold — no memory of generating them — and checks for internal contradictions, unstated assumptions, and over-generic content.

2. Adversarial Testing: Instead of just checking that everything is present and consistent, the verification will intentionally look for weaknesses. “Where is the thinnest spec?” “Which design doc has the fewest alternatives considered?” “Which acceptance criterion is hardest to test?”

3. Confidence Qualifiers: The report will distinguish between verified-correct (actually confirmed through multiple sources) and assumed-correct (consistent but not independently validated). A “Grade A (verified)” means something different from “Grade A (assumed).”

These improvements also connect to the dual quality gates I describe in Part 7 — the principle that verification and testing should be independent processes with independent failure modes.

The Agentic Engineering Principle

The broader lesson for Agentic Engineering is this: verification is not optional, but self-verification is not sufficient.

Build the verification pass. Run it every time. Trust its ability to catch cross-document inconsistencies. But never trust a perfect score from a self-grading system. When your AI agent reports that everything is flawless, that is exactly when you should bring in an independent reviewer.

The verification is not the final word. It is the starting point for a deeper investigation.

Have you built verification passes into your AI-assisted workflows? I would love to hear what you are checking for and what you have found.

Agentic Engineering, Part 6: Forensic Verification — Why a Perfect Score From Your AI Agent Should Make You Nervous

What Is a Forensic Verification Pass?

The Surprising “Zero Issues” Result

The Self-Grading Problem

What the Verification Pass Does Catch

How I Am Improving It

The Agentic Engineering Principle

Like this:

Leave a Reply Cancel reply

What Is a Forensic Verification Pass?

The Surprising “Zero Issues” Result

The Self-Grading Problem

What the Verification Pass Does Catch

How I Am Improving It

The Agentic Engineering Principle

Share this:

Like this:

Leave a Reply Cancel reply