The Verification Paradox: What 86 Experiments Taught Us About AI Code Review

Our most expensive model scored 6 out of 12. Not Haiku. Not Sonnet. Opus — the flagship — produced the worst result in the study when given a complex task without a verification step. Sonnet, at one-fifth the cost, scored 11 with a single added instruction: “Re-read your output. Remove anything you can’t prove.”

Epistemic discipline scales better than compute. That’s the core finding from 86 controlled experiments across three series. Here’s the data.

The Experiment

The core series, H-MATRIX, tested 23 combinations of prompt structure (plain vs. structured), model tier (Haiku, Sonnet, Opus), and verification (present vs. absent) across three task types: code review of production bash scripts with sealed ground truth, content generation with verifiable facts, and development — finding all instances of a specific bug pattern.

Every experiment was scored on a 0–12 scale across four dimensions: accuracy, depth, calibration, and actionability. The scorer didn’t see which configuration produced which output. One design note: each configuration was tested once. The dataset captures breadth — 86 distinct conditions — not statistical power within any single condition.

We then ran 20 additional experiments testing specific hypotheses — whether structured methods like CLAIM/MEASURE/PASS IF outperform plain analysis, whether domain knowledge compensates for missing verification, and how safety constraints affect quality. Combined with 36 prior experiments in the same research program, the dataset spans 86 experiments across three series.

Configuration	Code Review	Content	Dev
Plain prompt, no verify	7	10	11
Plain prompt + verify	9	10	11
Structured prompt, no verify	11	10	10
Structured prompt + verify	11	12	12
Structured + Opus + verify	12	—	—
Structured + Haiku + verify	8	0 (refused)	10
3-agent team + verify	9	10	12

Two patterns jump out: verification and structure interact in unexpected ways, and adding more agents actively hurts.

The Synergy Effect

On content and development, neither structured prompts nor verification helped alone. But combined, they hit 12 out of 12 — perfect scores on both.

This isn’t additive. It’s multiplicative. Structure tells the LLM what to look for. Verification tells it what to throw away. Without structure, there’s nothing to check. Without verification, the structure is decoration.

Code review was different: structure alone jumped the score from 7 to 11. Why? Because code review is inherently falsifiable — you look at the code and see whether a finding is real. The task contains its own verification. Content and development don’t have this built-in check. You can write a plausible paragraph with a fabricated statistic and not notice until someone asks for the source.

More Agents, Worse Results

This was the uncomfortable finding.

Task	Solo	3-Agent Team	Delta
Code Review	11	9	-2
Content	12	10	-2
Dev	12	12	0

The pattern is clear: teams matched solo performance only on the most mechanical task. On everything requiring judgment — calibrating severity, matching editorial voice — teams actively degraded quality.

The mechanism: in code review, a false positive about bash heredoc expansion propagated through both multi-agent experiments. The orchestration layer truncated the context window, dropping the safety instruction that would have prevented the error. Every additional agent was another opportunity for a plausible-sounding error to propagate — and the coordination overhead destroyed nuance.

We call these cognitive prions: false patterns that propagate through multi-agent pipelines, amplified by coordination rather than contained by it. The mechanism resembles information cascades in human group decision-making: each downstream agent treats upstream output as authoritative signal rather than uncertain input, compounding rather than canceling errors.

The Opus Paradox

In the prior experiment series, Opus without structure or verification scored 6/12 — the worst score across all three series. Sonnet with both scored 11–12/12. The inversion: a model costing 5x less outperformed the flagship by 5 points.

With verification, Opus clawed back to 12/12 — the only perfect code review score. Without it, fluency was a liability. A more capable LLM doesn’t make fewer mistakes. It makes more convincing ones.

If you’re upgrading models to improve accuracy, you’re solving the wrong problem. The bottleneck isn’t capability. It’s the inability to separate generation from evaluation.

”Be Careful” Makes AI Verification Worse

The most counterintuitive result involved safety constraints.

We tested three conditions on security audit tasks: a specific constraint naming the exact false pattern to avoid, no constraint at all, and a vague constraint — “be careful with security findings.”

Constraint Type	Score
Specific (“heredoc expansion is not recursive”)	11.5
None	10.5
Vague (“be careful”)	8.5

Vague instructions performed worse than no instructions at all. The mechanism connects directly to the autoregressive bias described in the next section: once “be careful” primes the model toward caution, every subsequent token is biased toward confirming that caution is appropriate. “Don’t flag documented issues” → the model treats any plausible-sounding issue as “probably documented” → real bugs get suppressed. Specific constraints escape this trap because they define exactly what to avoid, leaving the model free to evaluate everything else normally.

Name the specific trap, or say nothing. Vague caution is a degradation vector.

Why Verification Works: Autoregressive Bias

Once an LLM commits to a claim, every subsequent token is biased toward supporting it. The model optimizes for coherence, not accuracy. Anthropic’s own research on model sycophancy documents this pattern: models systematically adjust outputs toward what they predict the evaluator wants to hear, and the effect scales with model capability. A verification step breaks this by forcing a fresh evaluation pass, flipping the optimization target from generation to adversarial evaluation.

The key word is separate. Appending “please double-check” doesn’t help — the model is already biased toward its output. In the prior series, a review without any verification step scored 8/12. The same model on the same task with a structurally separate verification section scored 11/12.

One more nuance: verification quality depends on the model performing it. Sonnet’s verification pass caught 100% of overclaimed findings. Haiku with the same instruction let 35% of noise through. The verification step works, but a stronger model runs a sharper verification.

What Actually Works

After 86 experiments, the optimal configuration is embarrassingly simple:

Structured prompt — tell the model what to do, in what order, and how to judge quality
Sonnet — the middle-tier model
A separate verification step — “Re-read your output. Delete any finding you cannot trace to specific evidence.”
Specific constraints or none — never “be careful”

No agent swarms. No skill injection. No expensive models. This scored 11–12/12 on every task type.

What We Got Wrong

Our predictions were 4 out of 11 correct. We expected verification to help everywhere equally — it’s synergistic with structure, not additive. We expected teams to outperform solo agents — they don’t on judgment tasks. The structure-verification synergy was completely unpredicted. It emerged from the data, not from theory.

Limitations: Every configuration was tested once (N=1 per cell). The scorer also created the ground truth, introducing possible bias despite blinding. Team experiments used an orchestration layer that truncated context — the team degradation may partly reflect an implementation bug. All experiments used one model family (Claude). Results on GPT-4, Gemini, or open-weight models are untested.

Try This Today

Take your most important LLM pipeline
Add a structurally separate verification step: “Re-read your output. For each claim, identify the evidence. Remove any claim where the evidence is ambiguous.”
Measure the difference. In our experiments, a structurally separate verification step removed 15–35% of initial findings — with precision improving by 2–4 points on a 12-point scale. If your verification pass removes less than 10% of outputs, it likely isn’t doing real work: the instruction may not be structurally separate enough from the generation.

A more capable LLM doesn’t make fewer mistakes. It makes more convincing ones.

This article draws on three experiment series: H-MATRIX (23 experiments), prior VERIFY research (36 experiments), and hypothesis validation (27 experiments) — all conducted in February 2026. Experiment data, ground truth files, and scoring rubrics are available in the project’s research directory. Methodology: From 13/19 to 19/19: Hypothesis-Driven Development for AI Systems.