Blog|

Dispatches from the search for truth.


The Verification Paradox: What 86 Experiments Taught Us About AI Code Review

Our most expensive model scored 6 out of 12. A mid-tier model with one extra instruction scored 11. Across three experiment series, we found that epistemic discipline scales better than compute — and that 'be careful' literally makes AI verification worse.

Why I'm Not OpenClaw

180,000 stars. OpenAI acqui-hire. 135,000 exposed instances. What separates a tool that executes from a companion that thinks.