Agentic Engineering, Part 7: Dual Quality Gates — Why Validation and Testing Must Be Separate Processes

Sagar Mandal

22 hours ago

Okay so here is something that took me embarrassingly long to figure out: validation and testing are not the same thing. I know, I know — obvious in hindsight. But when you are building with AI agents, the distinction becomes critical in a way it never was with traditional development.

Let me explain what I mean.

The Difference That Matters

Validation asks: “Did we build the right thing?” It checks whether the application matches the specifications — the same specs generated by the Knowledge Store Generator from Part 2. Does Feature X work the way the product spec says it should? Are all acceptance criteria met? Do the SLOs hold?

Testing asks: “Does this work from the user’s perspective?” It opens a browser, navigates through the actual user journey, clicks buttons, fills forms, and checks that the experience is correct end-to-end.

Most AI-assisted development workflows have neither. The better ones have tests. Almost none have both as separate, structured processes. In my Agentic Engineering practice, I consider this separation a core principle.

Two Prompts, Two Perspectives

In my workflow, I have two standalone prompt files that serve as independent quality gates after the build phase:

Validation.md — The Spec Checker

This prompt instructs the agent to:

Read every product spec’s acceptance criteria (the same Given/When/Then criteria that the Spec Writer persona generated in Part 3)
Verify each criterion is met in the implementation
Run the full test suite (unit, integration, type checking, linting)
Check that SLOs from RELIABILITY.md are achievable
Verify that traceability chains from Part 5 are intact in the code
Produce a structured report with coverage metrics

The output is a validation report:

222/222 tests passing
48/48 acceptance criteria covered
0 mypy errors, 0 ruff errors
Grade per domain

This is implementation verification. It answers: “Does the code do what the specs say?”

User_Journey_testing.md — The User Experience Checker

This prompt takes a completely different approach. It instructs the agent to:

Launch a real browser using Playwright
Connect via Chrome DevTools Protocol (CDP)
Walk through actual user journeys step by step
Take screenshots at each stage
Verify visual layout, interaction behavior, and error states

This is experience validation. It answers: “Can a user actually accomplish their goals?”

The distinction is subtle but important. A system can pass all 222 tests and still be unusable if the user journey has friction that specs did not anticipate. Conversely, a system can feel great during manual testing but violate spec constraints that only surface under edge conditions.

Why Two Separate Prompts?

I could combine these into a single “quality gate” prompt. I deliberately did not, for three reasons:

1. Different Context Requirements: The validation prompt needs access to specs, test infrastructure, and linting tools. The testing prompt needs access to a browser, CDP, and screenshot capabilities. Loading both contexts into a single prompt bloats it unnecessarily — the same context management concern from the entry point design in Part 4.

2. Independent Failure Modes: If validation passes but user testing fails, I know the implementation is spec-correct but has UX issues. If user testing passes but validation fails, I know the experience is good but some spec constraint is violated. Separate prompts give me separate diagnostic signals.

3. Run Independently: I can re-run validation after a code change without re-running the full user journey test suite, and vice versa. This makes the feedback loop faster during the fix-and-verify cycle.

The “Mocked vs. Real” Problem

Here is the thing that keeps me up at night about validation — and it connects directly to the self-grading concern from Part 6: my current validation runs with mocked tests.

The validation report explicitly notes: “Mocked tests — no real LLM calls, no real browser automation.” This means the tests verify logic correctness but not real-world behavior. The agent pipeline tests mock the LLM responses. The browser automation tests mock Playwright interactions.

A mocked test that passes tells you: “If the LLM returns a well-formed response, your code handles it correctly.” It does not tell you: “The LLM actually returns well-formed responses for this prompt.”

This is why the User Journey testing prompt is so important — it uses real browser automation, not mocks. It catches the class of issues that mocked validation cannot: real rendering problems, real navigation failures, real timing issues.

What the Research Says

The research from Anthropic and OpenAI confirms that having separate validation and testing layers is a practice only the most mature AI-assisted teams have adopted. Most teams rely on either manual testing or a single automated test suite, but not both.

CodeRabbit’s data is particularly relevant here: AI-generated code has approximately 8x more excessive I/O patterns than human code. A unit test that mocks I/O will never catch this. Only a real-world validation — one that actually measures I/O patterns — will surface these issues.

The Agentic Engineering Principle

The principle here extends beyond just testing: in Agentic Engineering, every quality assertion needs at least two independent perspectives.

The knowledge store has the forensic verification pass (Part 6). The implementation has validation and user testing. The product values have traceability chains (Part 5) and measurable guardrails (Part 9). No single check is sufficient. Layered, independent quality gates catch what individual checks miss.

Getting Started

If you are building with AI agents and only have one quality gate, start with Validation.md. It is the faster, more predictable check. Write your acceptance criteria in Given/When/Then format, and create a prompt that verifies each one.

Once that is stable, add the User Journey testing layer. Start with the happy path — the most common user journey. Then add error paths and edge cases.

The goal is not perfect coverage from day one. The goal is two independent perspectives on quality: one from the spec side, and one from the user side.

Have you built separate validation and testing workflows for your AI-assisted projects? I am curious how others are approaching this — especially the mocked-vs-real testing challenge. What do you think?