Writing

Notes on building LLM systems that hold up under adversarial conditions.

Long-form on agent eval design, distribution-shift collapse, server-side verification, and what it actually takes to ship production AI that doesn't lie about its own results.

Drafts in flight

First posts landing soon. If a topic below would be useful for you specifically, email me — I'll prioritize it.

draft in progress

Why I put the eval verdict on a server the agents can't reach

If you let an LLM agent grade its own work, it will grade itself into success. The fix isn't a better prompt — it's a structural one. How I designed the four server-side Forge-verified gates in solhunt-duel so the agents are forced to actually win.

draft in progress

The 67.7% → 13% collapse: what really happened when I tested solhunt on random samples

solhunt beat Anthropic's SCONE-bench on the curated set. On a random 95-contract draw, it fell off a cliff. The 54-point gap is the most useful number in the whole repo — here's why it exists, why I published it, and what it told me to build next.

draft in progress

Reverse-engineering three proprietary retail APIs in production

SweedPOS, Algolia, and Trulieve GraphQL behind Cloudflare and session auth. Playwright + cookie handoff + the things you only learn by getting blocked. How a 14-state competitive intel platform actually runs on autopilot.