This article is part of the Oracles, Traces, Triage series.
Agent skills + agent swarms + stateful testing could compound into something stronger than any piece alone. While this combination hasn’t been tested at scale with agents yet, the individual components have proven effective in practice. The following explores how these pieces might integrate.
In The Bugs Between Calls, I argued:
The December 2025 Prysm incident, the May 2023 finality delays—stacking failures. Every individual operation was valid. The trace was the problem.
If you want another non-Ethereum example of “trace-shaped” failures, the Stacks PoX-2 stack-increase bug is a good one to skim (thread). I wasn’t at Stacks at the time, and I’m not claiming I would have caught it; the point is simply that these failures often emerge from sequences and accounting state, not single calls.
Stateful property-based testing (described in The Bugs Between Calls) generates random command sequences, runs them, checks invariants after each step.
The framework doesn’t know why a particular sequence might be interesting. It just tries many and hopes to stumble on something broken.
An AI agent is not blind:
It wouldn’t replace the random search. It would augment it with intentional exploration.
Following the agent skills philosophy, you wouldn’t give agents step-by-step procedures. You’d give them norms:
These go into CLAUDE.md and agent skills. The agent figures out how to test them. At least, that’s the idea.
With claude-swarm, you could run multiple agents against the same codebase, each exploring a different class of invariant:
Each agent pushes to the same repo. When one discovers a failing trace, the others see it on the next fetch.
No message passing needed—the test failures are the messages.
Today, you get a shrunk counterexample and figure out what it means yourself. With agents, the cycle could become:
Steps 4 and 5 are currently hours of manual work. They wouldn’t be free with agents—output still needs triage. But the iteration speed could be fundamentally different.
While this complete loop hasn’t been tested end-to-end yet, each component exists and has proven valuable in isolation.
The agent’s advantage: structured exploration—when invariants are rich, the state machine is complex, and interesting traces require understanding.
| Piece | Role |
|---|---|
| Agent skills | Encode what matters—norms, invariants, properties |
| Agent swarms | Provide parallel exploration—multiple agents, different state-space regions |
| Stateful testing | Provide the execution engine—command sequences, invariant checks, shrinking |
Each piece works alone. Together, they compound.
This hypothesis remains unproven at scale. However, the individual components have demonstrated value in isolation, and their integration appears promising.
The best testing infrastructure usually emerges that way—you notice the pieces reinforcing each other before you design the integration.
Next: Agents as Fuzzers