moodmosaic

Testing the Bugs Between Calls

This article is part of the Oracles, Traces, Triage series.

The short version

Agent skills + agent swarms + stateful testing could compound into something stronger than any piece alone. While this combination hasn’t been tested at scale with agents yet, the individual components have proven effective in practice. The following explores how these pieces might integrate.

The problem

In The Bugs Between Calls, I argued:

The December 2025 Prysm incident, the May 2023 finality delays—stacking failures. Every individual operation was valid. The trace was the problem.

If you want another non-Ethereum example of “trace-shaped” failures, the Stacks PoX-2 stack-increase bug is a good one to skim (thread). I wasn’t at Stacks at the time, and I’m not claiming I would have caught it; the point is simply that these failures often emerge from sequences and accounting state, not single calls.

What agents could add

Stateful property-based testing (described in The Bugs Between Calls) generates random command sequences, runs them, checks invariants after each step.

The framework doesn’t know why a particular sequence might be interesting. It just tries many and hopes to stumble on something broken.

An AI agent is not blind:

It wouldn’t replace the random search. It would augment it with intentional exploration.

Three pieces

1. Agent skills define testing norms

Following the agent skills philosophy, you wouldn’t give agents step-by-step procedures. You’d give them norms:

These go into CLAUDE.md and agent skills. The agent figures out how to test them. At least, that’s the idea.

2. Agent swarms parallelize exploration

With claude-swarm, you could run multiple agents against the same codebase, each exploring a different class of invariant:

Each agent pushes to the same repo. When one discovers a failing trace, the others see it on the next fetch.

No message passing needed—the test failures are the messages.

3. The feedback loop tightens

Today, you get a shrunk counterexample and figure out what it means yourself. With agents, the cycle could become:

  1. Agent generates command sequences based on norms
  2. proptest executes them, finds a failure
  3. proptest shrinks the failure to a minimal trace
  4. Agent reads the trace, generates a root-cause hypothesis
  5. Agent writes a regression test

Steps 4 and 5 are currently hours of manual work. They wouldn’t be free with agents—output still needs triage. But the iteration speed could be fundamentally different.

While this complete loop hasn’t been tested end-to-end yet, each component exists and has proven valuable in isolation.

Where this probably won’t work

The agent’s advantage: structured exploration—when invariants are rich, the state machine is complex, and interesting traces require understanding.

The combination

Piece Role
Agent skills Encode what matters—norms, invariants, properties
Agent swarms Provide parallel exploration—multiple agents, different state-space regions
Stateful testing Provide the execution engine—command sequences, invariant checks, shrinking

Each piece works alone. Together, they compound.

This hypothesis remains unproven at scale. However, the individual components have demonstrated value in isolation, and their integration appears promising.

The best testing infrastructure usually emerges that way—you notice the pieces reinforcing each other before you design the integration.


Next: Agents as Fuzzers