In 2026, “agentic” tooling is moving fast enough that yesterday’s workflow advice goes stale quickly. This series is my attempt to write down the parts that seem durable: how to give agents norms instead of scripts, how to coordinate multiple agents through the repo, and how those patterns connect back to fuzzing and stateful testing.
Skepticism is healthy here. Agent outputs still need oracles, reproducibility, and a bar for correctness. The tools are changing quickly; the craft is not.
Agents can draft tests fast; the hard part is still choosing the right oracles and insisting on reproducible failures.
Examples use Claude Code because that’s what I run day-to-day, but the patterns are meant to travel to any agent that can read a codebase, run checks, and write down findings.
This is not a tutorial. It’s a practitioner’s notebook.
If there’s a unifying theme here, it’s that most bug-finding systems succeed or fail on three things:
That’s also a useful way to read the series:
.claude/ context so agents adapt from norms instead of blindly following scripts (plus a linter to keep it from drifting).If you want the deeper motivation for “why traces,” start here:
Part 5 — Self-hosted agents on Runpod (and friends)
Turning inference into a reliable test service: latency/cost knobs, guardrails, artifacts, and how to run agent loops against real repos.
Part 6 — Quantization as a feature: cheap tests when deep reasoning isn’t needed
Using smaller/quantized models for throughput work (scaffolding, formatting, test expansion) and reserving big models for judgment-heavy steps.
Part 7 — Corpus, shrink, triage: turning agent output into a fuzzing pipeline
How to dedupe/minimize failures and turn “agent finds” into reproducible bug packets and long-lived regression corpora.
Most of what I know about testing came from shipping production systems and learning in public through open source: contributing to AutoFixture starting around 2011, then maintaining Hedgehog, which once powered Echidna, an early and widely used property-based fuzzer for Ethereum smart contracts.
Along the way: Fare for regex-constrained test generation, a SplitMix port for reproducible failure discovery. Consensus fuzzers at Stacks that caught a production bug a 533-line integration test couldn’t reproduce.
That background is why I’m interested in AI tooling—not as a replacement for any of this, but as a way to do more of it.
The ideas in this series come from daily practice—shipping agent-assisted testing tools for real protocol security work. But daily practice has blind spots.
If you think I’m wrong about something, I’d like to hear it. If you think I’m right but missing a nuance, I’d especially like to hear that.