LLM systems, agent evals, production data infrastructure. solhunt-duel beats Anthropic's SCONE-bench at 67.7% — and I published the 13% honest random-sample collapse next to it.
$ forge test --match test/Beanstalk [PASS] testExploit() (gas: 8,420,180) drained: 184M USDC cost: $0.65 time: 1m 44s gates: 4/4 ✓ (server-side)
Adversarial red/blue agent system for smart-contract auditing. Red writes exploits, Blue writes patches, four server-side Forge-verified gates decide the verdict — the agents cannot see or modify them. Multi-provider LLM router across Anthropic, OpenAI, OpenRouter, and Ollama with cost controls and structured fallbacks.
solhunt benchmark curated 32 67.7% v anthropic SCONE-bench 51.1% random 95 13.0% honest distribution-shift delta 54 pts → origin of solhunt-duel beanstalk EXPLOITED $0.73 · 1 contract
Autonomous AI agent: reads Solidity, writes a Foundry exploit test, executes on a forked mainnet, iterates against real compiler and execution feedback. Reproduced Beanstalk's $182M flash-loan hack in 1m 44s for $0.65. Beat Anthropic's SCONE-bench on curated set, then collapsed on a random sample — I published both numbers and treated the gap as a design problem.
Live mainnet, $1,554 personal capital, 11 server-side risk gates. Regime-stratified backtest with Bonferroni-adjusted promote gate. 50 random wallets through it as a null distribution — 0 promoted at α=0.05. The harness, not the trades, is the artifact.
Production data engineering at scale. 14 states, 31 dispensary chains, 65+ stores, 50K+ products on automated 6-hour cycles. Reverse-engineered 3 proprietary retail APIs (SweedPOS, Algolia, Trulieve GraphQL) under Cloudflare/auth. BullMQ + Redis worker fleet, Postgres normalization, OCR pipeline, Telegram alerts. Used daily by my 4-person pricing team.
[INFO] helius ws connected · raydium + orca [DETECT] front 0x4a.. victim 0x9c.. back 0x2b.. slots: 287_412_891 → 891 → 893 (Δ=2) profit: 0.47 SOL jito tip: 0.02 SOL confidence: 90
Real-time Solana MEV sandwich detector. Rust + Helius enhanced WebSocket + bounded-backpressure parser pool + per-pool ring buffer with ≤3-slot detection window. IDL-correct pool extraction for Raydium AMM v4 + Orca Whirlpool. Jito tip detection across 8 known tip accounts. Idempotent Postgres persistence, SSE feed.
Production React app running in warehouse operations since 2024, currently v7.4.1. Barcode scanning, thermal label generation for Zebra ZT610 at 203 DPI, multi-source CSV/Excel imports, Supabase-backed master list with offline localStorage fallback. The first thing I ever shipped — still in daily use.
23. Started in warehouse operations, taught myself to code to fix the manual work my team was drowning in. Three years later I'm shipping adversarial agent evals, production data pipelines, and a live mainnet trading system on my own capital.
I write things down honestly. When solhunt's exploit rate dropped from 67.7% on curated contracts to 13% on a random sample, I published both numbers and treated the gap as a design problem — that's the origin of solhunt-duel's server-side verifier gates.
By day I build competitive intelligence infrastructure for a 4-person pricing team at a Fortune 500. By night I build LLM systems that can't lie about their own results.
See the harness tearsheet →
Long-form notes on building LLM systems that hold up under adversarial conditions. → /blog
Email me about LLM systems, evals, agents, or production AI work. Open to AI engineering roles, remote-friendly.