Testing the Bugs Between Calls

This article is part of the Oracles, Traces, Triage series.

The short version

Agent skills + agent swarms + stateful testing could compound into something stronger than any piece alone. While this combination hasn’t been tested at scale with agents yet, the individual components have proven effective in practice. The following explores how these pieces might integrate.

The problem

In The Bugs Between Calls, I argued:

The most expensive failures don’t live in single function calls
They live in sequences—valid operations that, composed under load, trigger liveness incidents
Stateless property-based testing catches bugs in the bricks
Stateful property-based testing catches bugs in how the bricks stack

The December 2025 Prysm incident, the May 2023 finality delays—stacking failures. Every individual operation was valid. The trace was the problem.

If you want another non-Ethereum example of “trace-shaped” failures, the Stacks PoX-2 stack-increase bug is a good one to skim (thread). I wasn’t at Stacks at the time, and I’m not claiming I would have caught it; the point is simply that these failures often emerge from sequences and accounting state, not single calls.

What agents could add

Stateful property-based testing (described in The Bugs Between Calls) generates random command sequences, runs them, checks invariants after each step.

The framework doesn’t know why a particular sequence might be interesting. It just tries many and hopes to stumble on something broken.

An AI agent is not blind:

It can read the specification
It can study past incidents
It can reason about which sequences are likely to trigger interesting states

It wouldn’t replace the random search. It would augment it with intentional exploration.

Three pieces

1. Agent skills define testing norms

Following the agent skills philosophy, you wouldn’t give agents step-by-step procedures. You’d give them norms:

Idempotent imports: importing the same block twice must not double-apply side effects
Epoch boundaries: boundary logic must not run twice across reorgs
Invariant preservation: state transitions must preserve declared invariants

These go into CLAUDE.md and agent skills. The agent figures out how to test them. At least, that’s the idea.

2. Agent swarms parallelize exploration

With claude-swarm, you could run multiple agents against the same codebase, each exploring a different class of invariant:

One explores idempotence
Another targets epoch boundaries
Another maintains the test infrastructure

Each agent pushes to the same repo. When one discovers a failing trace, the others see it on the next fetch.

No message passing needed—the test failures are the messages.

3. The feedback loop tightens

Today, you get a shrunk counterexample and figure out what it means yourself. With agents, the cycle could become:

Agent generates command sequences based on norms
proptest executes them, finds a failure
proptest shrinks the failure to a minimal trace
Agent reads the trace, generates a root-cause hypothesis
Agent writes a regression test

Steps 4 and 5 are currently hours of manual work. They wouldn’t be free with agents—output still needs triage. But the iteration speed could be fundamentally different.

While this complete loop hasn’t been tested end-to-end yet, each component exists and has proven valuable in isolation.

Where this probably won’t work

Enormous state spaces, simple invariants. If your oracle is “did it crash,” a traditional fuzzer wins. Agents are slow by comparison.
Precise mathematical constraints. When the goal is “find the exact input satisfying this formal constraint,” SMT solvers are more reliable. Agents reason about code, but they don’t exhaustively search a constraint space.

The agent’s advantage: structured exploration—when invariants are rich, the state machine is complex, and interesting traces require understanding.

The combination

Piece	Role
Agent skills	Encode what matters—norms, invariants, properties
Agent swarms	Provide parallel exploration—multiple agents, different state-space regions
Stateful testing	Provide the execution engine—command sequences, invariant checks, shrinking

Each piece works alone. Together, they compound.

This hypothesis remains unproven at scale. However, the individual components have demonstrated value in isolation, and their integration appears promising.

The best testing infrastructure usually emerges that way—you notice the pieces reinforcing each other before you design the integration.

Next: Agents as Fuzzers