The Verification Paradox: What 86 Experiments Taught Us About AI Code Review
Our most expensive model scored 6 out of 12. A mid-tier model with one extra instruction scored 11. Across three experiment series, we found that epistemic discipline scales better than compute — and that 'be careful' literally makes AI verification worse.