blog

Agents as Fuzzers

2026-02-16T00:00:00+00:00

This article is part of the Oracles, Traces, Triage series.

The short version

A fuzzer is a search tool whose results must be triaged. An AI agent is a search tool whose results must be triaged. The parallel is not metaphorical. I think it’s structural.

Two search tools

A fuzzer explores the input space of a program, looking for inputs that violate some oracle—a crash, a hang, a property violation. When it finds something, you triage: real bug? Duplicate? Exploitable?

An AI agent explores the solution space of a problem, looking for code, fixes, or tests that satisfy some goal. When it produces something, you triage: correct? Complete? Does it address the problem?

Both search. Both produce results that need judgment. Both waste enormous time if pointed in the wrong direction.

The anatomy, side by side

Every fuzzer does four things:

Generates inputs (random, mutational, grammar-based, coverage-guided)
Executes the target with those inputs
Checks an oracle (crash? new coverage? property violation?)
Saves interesting results for triage

Every AI agent does the same four things:

Generates candidates (from prompt, codebase, agent skills)
Executes or applies them (writes code, runs tests, modifies files)
Checks an oracle (tests pass? linter clean? invariants hold?)
Saves results for triage (commits, PRs, logs)

Replace “inputs” with “candidates” and “crash” with “test failure.” The structure is identical.

What changes when the searcher understands context

Traditional fuzzers are context-blind. AFL doesn’t know what a function does. libFuzzer doesn’t understand the specification. They compensate with volume—millions of executions per second.

Context-blindness has costs:

Shallow oracles. “Did it crash?” works. “Does this violate the protocol invariant?” requires a custom harness—often harder to write than the code being tested.
Redundant exploration. Without understanding structure, the fuzzer wastes cycles in uninteresting regions of input space.
Triage burden. Many findings are duplicates, benign panics, or expected edge cases. You sort the signal from the noise.

An AI agent, by contrast:

Can read the specification
Can reason about which inputs trigger interesting behavior
Can write its own oracle and generate inputs designed to challenge it

The search becomes intentional without becoming rigid.

The convergence

Combine the pieces from this series:

Piece	Fuzzer equivalent	What it adds
Agent skills	Oracle	Richer than “did it crash?”—norms that agents translate into testable properties
Agent swarms	Multiple seeds	Parallel search where each instance can specialize, sharing findings via git
Stateful testing	Execution loop	For traces instead of single inputs

Together: context-aware search, parallel exploration, rich oracles.

Fuzzers still win at

Speed. Millions of executions/sec with a simple oracle (“did it crash?”). AFL and libFuzzer are unbeatable here.
Binary targets. No source code, no spec? Blind fuzzing is often the only option.
Deterministic reproduction. Fuzzers produce exact inputs. Agent traces may need work to become deterministic.
Corpus management. Mature fuzzers have corpus minimization, coverage tracking, seed scheduling. Agent ecosystems don’t—yet.

Agents win at

Rich invariants. “Does this sequence of state transitions preserve safety properties?” An agent can both formulate and check the invariant.
Spec-guided search. When the spec exists and is readable, agents generate targeted campaigns rather than relying on coverage alone.
Triage. An agent can produce a root-cause hypothesis before you ever see the failure. It can check for duplicates.
Harness generation. Writing fuzz harnesses is expert work. Agents can draft them from specs and iterate.

The spectrum

	Traditional Fuzzer	AI Agent
Input generation	Random / mutational / grammar	Context-aware / intentional
Oracle	Crash / coverage / property	Natural-language norm → property
Speed	Millions of executions/sec	Seconds to minutes per session
Context understanding	None	Deep
Triage	Manual	Agent-assisted
Parallelism	Independent seeds	Coordinated via git

The gap is narrowing. What matters is understanding which tool fits which problem—and being willing to combine them.

In practice

Traditional fuzzers for the fast, low-level search—serialization, encoding edge cases, roundtrip invariants. Simple oracles, enormous input spaces. Volume wins.
AI agents for the slow, high-level search—stateful invariants, cross-component interactions, spec compliance. Complex oracles, understanding required. Context wins.
Both together—agents generating hypotheses and fuzz harnesses, fuzzers executing at speed, agents triaging the results.

Fuzzing was barely known outside security research fifteen years ago. Standard practice after AFL and OSS-Fuzz. Table stakes today.

AI-assisted testing is on the same trajectory.

Testing the Bugs Between Calls

2026-02-15T00:00:00+00:00

This article is part of the Oracles, Traces, Triage series.

The short version

Agent skills + agent swarms + stateful testing could compound into something stronger than any piece alone. While this combination hasn’t been tested at scale with agents yet, the individual components have proven effective in practice. The following explores how these pieces might integrate.

The problem

In The Bugs Between Calls, I argued:

The most expensive failures don’t live in single function calls
They live in sequences—valid operations that, composed under load, trigger liveness incidents
Stateless property-based testing catches bugs in the bricks
Stateful property-based testing catches bugs in how the bricks stack

The December 2025 Prysm incident, the May 2023 finality delays—stacking failures. Every individual operation was valid. The trace was the problem.

If you want another non-Ethereum example of “trace-shaped” failures, the Stacks PoX-2 stack-increase bug is a good one to skim (thread). I wasn’t at Stacks at the time, and I’m not claiming I would have caught it; the point is simply that these failures often emerge from sequences and accounting state, not single calls.

What agents could add

Stateful property-based testing (described in The Bugs Between Calls) generates random command sequences, runs them, checks invariants after each step.

The framework doesn’t know why a particular sequence might be interesting. It just tries many and hopes to stumble on something broken.

An AI agent is not blind:

It can read the specification
It can study past incidents
It can reason about which sequences are likely to trigger interesting states

It wouldn’t replace the random search. It would augment it with intentional exploration.

Three pieces

1. Agent skills define testing norms

Following the agent skills philosophy, you wouldn’t give agents step-by-step procedures. You’d give them norms:

Idempotent imports: importing the same block twice must not double-apply side effects
Epoch boundaries: boundary logic must not run twice across reorgs
Invariant preservation: state transitions must preserve declared invariants

These go into CLAUDE.md and agent skills. The agent figures out how to test them. At least, that’s the idea.

2. Agent swarms parallelize exploration

With claude-swarm, you could run multiple agents against the same codebase, each exploring a different class of invariant:

One explores idempotence
Another targets epoch boundaries
Another maintains the test infrastructure

Each agent pushes to the same repo. When one discovers a failing trace, the others see it on the next fetch.

No message passing needed—the test failures are the messages.

3. The feedback loop tightens

Today, you get a shrunk counterexample and figure out what it means yourself. With agents, the cycle could become:

Agent generates command sequences based on norms
proptest executes them, finds a failure
proptest shrinks the failure to a minimal trace
Agent reads the trace, generates a root-cause hypothesis
Agent writes a regression test

Steps 4 and 5 are currently hours of manual work. They wouldn’t be free with agents—output still needs triage. But the iteration speed could be fundamentally different.

While this complete loop hasn’t been tested end-to-end yet, each component exists and has proven valuable in isolation.

Where this probably won’t work

Enormous state spaces, simple invariants. If your oracle is “did it crash,” a traditional fuzzer wins. Agents are slow by comparison.
Precise mathematical constraints. When the goal is “find the exact input satisfying this formal constraint,” SMT solvers are more reliable. Agents reason about code, but they don’t exhaustively search a constraint space.

The agent’s advantage: structured exploration—when invariants are rich, the state machine is complex, and interesting traces require understanding.

The combination

Piece	Role
Agent skills	Encode what matters—norms, invariants, properties
Agent swarms	Provide parallel exploration—multiple agents, different state-space regions
Stateful testing	Provide the execution engine—command sequences, invariant checks, shrinking

Each piece works alone. Together, they compound.

This hypothesis remains unproven at scale. However, the individual components have demonstrated value in isolation, and their integration appears promising.

The best testing infrastructure usually emerges that way—you notice the pieces reinforcing each other before you design the integration.

Next: Agents as Fuzzers

Agent Teams and claude-swarm

2026-02-08T00:00:00+00:00

This article is part of the Oracles, Traces, Triage series.

One agent hits a ceiling

A single Claude Code session can do one thing at a time. For small tasks—fix this function, write that test—that’s fine. But the work I care about is not small. Exploring multiple hypotheses in parallel, maintaining documentation while debugging, running specialized analysis while generating test harnesses.

One agent, one task, one context window. It doesn’t scale.

The agent-team pattern

In early February 2026, Anthropic published Building a C Compiler with Large Language Models—a detailed account of 16 Claude instances working in parallel to produce a 100,000-line Rust-based C compiler capable of building the Linux kernel. The total: nearly 2,000 Claude Code sessions, 2 billion input tokens, 140 million output tokens.

The architecture was surprisingly simple. No orchestrator. No message bus. No shared memory. Just git.

Each agent ran in a Docker container. Each cloned a shared bare repo, worked on a task, and pushed. When two agents tried to claim the same task, git’s built-in conflict resolution forced the second one to pick something else. Merge conflicts happened often; Claude was smart enough to resolve them.

The key insight: coordination through the codebase itself. The repo is the shared state. Commits are the messages. Locks are text files.

claude-swarm

This pattern can be implemented through claude-swarm—a reusable harness currently wired for running multiple Claude Code sessions in Docker containers, coordinating through git. The coordination pattern itself is tool-agnostic; claude-swarm is just one concrete implementation.

export ANTHROPIC_API_KEY="sk-ant-..."
export AGENT_PROMPT="path/to/prompt.md"
./tools/claude-swarm/launch.sh start
./tools/claude-swarm/launch.sh status
./tools/claude-swarm/launch.sh stop

The design is minimal by conviction, not by laziness:

Host                        /tmp (bare repos)
~/project/ ── git clone ──> project-upstream.git (rw)
               --bare       project-mirror-*.git (ro)
                                     |
                                     | docker volumes
                                     |
               .---------------------+---------------------.
               |                     |                     |
           Container 1          Container 2          Container 3
           /upstream  (rw)      /upstream  (rw)      /upstream  (rw)
           /mirrors/* (ro)      /mirrors/* (ro)      /mirrors/* (ro)
               |                     |                     |
               v                     v                     v
           /workspace/          /workspace/          /workspace/
           (agent-work)         (agent-work)         (agent-work)

All containers mount the same bare repo. When one agent pushes, others see the changes on the next fetch. Each container runs harness.sh, which clones, resets to origin/agent-work, runs one Claude session, and loops. Agents stop after a configurable number of idle sessions with no commits.

Why no orchestrator

The temptation is always to add a coordinator—something that assigns tasks, monitors progress, resolves conflicts. This approach avoids orchestration for the same reason it avoids workflow verbs in CLAUDE.md: centralized control tends to reduce agent autonomy and reasoning capabilities.

With no orchestrator, each agent must orient itself. It reads the README, checks the current state of the code, looks at what other agents have done, and decides what to work on next. This mirrors how good engineering teams actually function: shared context, local autonomy, coordination through artifacts.

Anthropic’s experience confirmed the pattern. Their agents maintained running docs of failed approaches. They took locks on tasks by writing text files. They specialized naturally—one agent coalescing duplicate code, another improving performance, another working on documentation.

Specialization is possible, not required

Right now, all agents in claude-swarm share the same prompt. They self-organize by looking at the repo and picking different things to work on.

Anthropic’s experience suggests that per-agent prompts—one focused on code quality, another on test coverage, another on documentation—can help at scale. claude-swarm supports that (just point AGENT_PROMPT at different files per container), but In practice, shared prompts often suffice for initial implementations. Agents typically self-organize effectively without specialized prompts.

This connects to the agent skills philosophy: the prompt shapes behavior. The harness just runs the loop.

When it works, when it doesn’t

Agent swarms work best when the problem decomposes into independent sub-tasks—many distinct failing tests, different modules, separate components. Each agent picks a different piece, and parallelism is trivial.

They struggle when the problem is monolithic. Anthropic hit this when compiling the Linux kernel: every agent would find the same bug, fix it independently, and overwrite each other’s changes. Their solution was to use GCC as an oracle and randomly split compilation between GCC and their compiler, letting each agent work on different failing file subsets.

For testing work, the decomposition is usually natural. Different invariants to test. Different modules to fuzz. Different state-machine paths to explore. The swarm pattern fits.

What this is really about

claude-swarm is about 200 lines of shell. It’s not the point.

The point is that the agent-team pattern—N autonomous agents, shared codebase, no central control—is a genuine paradigm for how AI-assisted work can scale. It’s not about making one agent smarter. It’s about making many agents productive together, the same way you’d make a team of engineers productive: clear context, local ownership, shared truth in the repo.

The C compiler was the proof of concept. Fuzz testing is where I’m applying it.

Next: Testing the Bugs Between Calls

Agent Skills and claude-lint

2026-02-01T00:00:00+00:00

This article is part of the Oracles, Traces, Triage series.

The temptation

The first thing most people do with a CLAUDE.md file is write a recipe. Step 1, do this. Step 2, do that. If you see an error, run this command. Here’s a code block you can paste.

It works. For about a week. Then the codebase shifts, the recipe goes stale, and the model follows outdated instructions with the confidence of someone who doesn’t know they’re wrong.

I’ve seen this pattern before. It’s the same failure mode as over-specified test fixtures: the more you hard-code the steps, the more brittle the system becomes. The test passes for the wrong reasons. The agent succeeds for the wrong reasons.

Context should shape reasoning, not script behavior

This distinction proves crucial in practice. When .claude/ directories emphasize workflows over norms, models tend to follow outdated instructions rigidly. When structured around principles and facts, models demonstrate greater adaptability to changing contexts.

Whether this constitutes “reasoning from principles” in a deep sense remains an open question. However, the resulting outputs consistently demonstrate improved quality and relevance.

Think about it from a testing perspective. A unit test that asserts f(3) == 7 checks one input. A property that asserts for all x: f(f_inverse(x)) == x checks the relationship.

Change how f computes internally and the property still holds—it only cares that the roundtrip works. The hard-coded assertion breaks the moment the mapping shifts.

Same idea. A CLAUDE.md that says “run cargo test after every change” is a hard-coded assertion. A CLAUDE.md that says “all changes must pass the existing test suite” is a property. The model can figure out how to run the tests. What it needs from you is what matters.

The layers

Over time, I’ve settled on a layered structure for .claude/ directories:

Layer	What belongs	What doesn’t
`CLAUDE.md`	Norms, facts, project conventions	Workflow verbs (“step 1”, “then do”), code blocks
`agents/*.md`	Perspective, values (≤120 lines)	Procedures, code blocks
`skills/*/SKILL.md`	Capabilities (≤500 lines)	Success criteria, code blocks
`references/*.md`	Playbooks, optional reference material	Missing “optional” declaration

CLAUDE.md is the constitution. Short. Declarative. “This project uses Rust.” “Tests must pass before commits.” “Prefer explicit error handling over unwrap.” No instructions on how to do things—just what matters.

Agents get a perspective. If you have a code-quality agent, it gets values like “favor readability over cleverness” and “flag any function longer than 40 lines.” It doesn’t get a checklist.

Skills describe capabilities the model can use—not step-by-step procedures. A skill for “running fuzzers” says what the fuzzer does, what inputs it expects, what success looks like at a high level. It does not contain a bash script.

References are the escape hatch. Sometimes you genuinely need a playbook—a deployment procedure, a migration guide. References hold those, but they must declare themselves as optional. The model should know these are reference material, not marching orders.

claude-lint

A Rust CLI tool called claude-lint helps enforce these patterns by checking .claude/ directories for violations.

$ claude-lint .claude
ok: .claude passes all checks

$ claude-lint /path/to/.claude
error: /path/to/.claude/CLAUDE.md: contains workflow verb 'step 1'
error: /path/to/.claude/skills/foo/SKILL.md: contains fenced code block
2 error(s)

It checks for:

Workflow verbs in CLAUDE.md (e.g., “step 1”, “then run”, “next, do”)
Code blocks where they don’t belong (everywhere except references)
Line limits on agents (≤120) and skills (≤500)
Missing “optional” declarations in reference files

It’s deliberately strict. The point is not to make .claude/ directories pleasant to read. The point is to keep them in the shape where I’ve seen the model produce the best results.

Why this matters in practice

Claude Code demonstrates this approach in practice, using structured context to explore edge cases, generate test harnesses, and reason about state-machine invariants. The quality of outputs correlates directly with the quality of provided context.

When I embed workflows, the model sticks to them—even when they’re wrong for the current situation. When I embed norms (“never skip precondition checks”, “all state transitions must be tested for idempotence”), I get output that adapts to whatever the model finds in the codebase.

Whether that’s “reasoning from norms” or just the model having more room to draw on its training, I can’t say for certain. What I can say is that the parallel to property-based testing feels right. Properties tell the system what must hold. The system figures out how to check it. Norms tell the model what matters. The model figures out how to act on it.

Same shape. I’ll take it.

Oracles, Traces, Triage (series index)
The Bugs Between Calls

Next: Agent Teams and claude-swarm

The Bugs Between Calls

2026-01-31T00:00:00+00:00

Anthropic just showed how far property-based testing can go when you can express a property at a function boundary. Their agent generated Hypothesis tests for real-world libraries and validated/reported several bugs in NumPy, Pandas, and SciPy.

One important gap is that, while we still see critical bugs in single calls (SSZ decoding, BLS edge cases), many of the most expensive recent failures live between calls.

In many cases, the protocol rules are fine; the failure is a valid-but-expensive trace that turns into a liveness incident under load.

December 2025: Shortly after Fusaka activated (Dec 3, 2025), Prysm hit a resource-exhaustion path processing certain attestations, dropping network participation to ~75% and pushing voting participation as low as ~74.7% in some epochs—uncomfortably close to the 2/3 stake threshold required for finality. In this incident, attestations referencing a previous-epoch block root could trigger repeated state recreation, replay, and epoch-transition recomputation, exhausting node resources under load. (See the Prysm mainnet postmortems for the primary write-up.)

May 2023: Mainnet finality was delayed twice within ~24 hours (first ~4 epochs, then ~9). The trigger was valid old-target attestations that forced expensive beacon-state regeneration in some clients; diversity helped the chain recover without intervention. (Postmortem: Ethereum Mainnet Finality Incident (May 2023).)

April 2023: Stacks hit a PoX-2 bug in stack-increase that impacted Stacking rewards for a cycle. The details are different, but the shape is familiar: stateful logic where correctness is about how an accounting state evolves over a sequence of actions, not a single call in isolation. I wasn’t at Stacks at the time, and I’m not claiming “I would have caught it” — I mention it because it’s a clean example of why tests that exercise traces (not just inputs) matter. (Thread: A bug in stacks-increase call is impacting Stacking rewards this cycle.)

Each operation was valid. The sequence proved problematic only under load.

Many expensive bugs live in sequences that look fine individually.

Stateless properties (and where they stop)

Stateless properties shine when:

the function boundary is the correctness boundary
behavior is local to a single invocation
invariants don’t depend on history

This covers a lot of “pure-ish” code: parsing, formatting, serialization, numerical edge cases.

But consensus software is not primarily pure functions.

Consensus clients are state machines

Ethereum’s consensus clients (e.g., Lighthouse, Prysm, Teku, Grandine, Nimbus, Lodestar) implement a long-lived state machine:

per-slot processing (slots advance, duties change, messages arrive out of order)
data availability (verifying that required data is available; evolving toward Data Availability Sampling via PeerDAS)
fork choice (multiple competing branches, attestation-weighted via LMD-GHOST)
finality (justified/finalized checkpoints that must only move forward)
storage and replay (idempotence, witness caching, pruning, reorgs)

Correctness is rarely “the output of one function call”. It’s “the system’s behavior over a trace”.

Examples of stateful invariants in consensus clients

Here are a few invariants that are naturally history-dependent:

Finality is monotonic: the finalized checkpoint’s epoch must never decrease. (Finality can stall; it must not regress.)
Fork choice respects finality: once a checkpoint is finalized, the selected head must be a descendant of it. (Heads can reorg; finalized history cannot.)
Data Availability gates what validators can accept/vote for: A block header is not enough in 2026. Availability is enforced via fork-choice/voting rules: validators should only accept and vote for blocks once sufficient data availability has been verified (today: all blobs; with PeerDAS: sampling cells/columns). Fork choice can only safely give full weight to blocks that validators can legally vote for. Testing the transition from “pending availability” to “available and valid” is a classic stateful trace.
Stake-weighted participation (MaxEB): Participation was always stake-weighted, but MaxEB makes the variance visible by raising the cap from 32 ETH to 2048 ETH. Not every validator will immediately sit at the cap, but stake weight per validator can now vary widely, so a bug that affects a handful of high-effective-balance validators can represent outsized stake impact.
Idempotent imports: importing the same block twice (same root) must not double-apply side effects (DB indexes, caches, votes, metrics, etc.).
Equivocations must be handled, not assumed away: you can see multiple distinct blocks for the same slot. A client shouldn’t panic or corrupt state just because reality is adversarial.
Epoch-boundary logic must not run twice: “do X once per epoch” bugs are classic state-machine failures when reorgs, retries, and partial persistence meet.

If you try to phrase these as “(f(x)) preserves (P)”, you end up smuggling “history” into (x) until it stops being a useful boundary.

Take “finality is monotonic.” You might try:

for all (old_finalized, new_finalized):
    process_block(...) implies new_finalized >= old_finalized

But now old_finalized is part of the input. Where does it come from? You have to generate it. And to generate a valid old state, you need to know what sequence of blocks led there. You’ve just reinvented traces—badly.

The honest framing is: “after any valid sequence of operations, the finalized epoch never decreases”. That’s a property over traces, not over inputs.

Model-based, stateful property-based testing

Stateful testing makes the history explicit:

State --(Command)--> State'

Instead of generating inputs for a single call, you generate commands and run them as a scenario. The bug is often not in any single step, but in a particular ordering of steps.

This idea is old and battle-tested (QuickCheck state machine testing, Hedgehog, proptest-state-machine), but to my knowledge still underused in many production systems.

The same approach, built into madhouse-rs, caught a production bug in the Stacks blockchain that traditional testing missed. A 533-line integration test failed to reproduce it. A chaotic command sequence succeeded.

Model-based, stateful testing has been applied successfully to production systems like the Stacks PoX contracts. The approach proved practical for ongoing use, helping catch issues that traditional testing methods missed and demonstrating the value of stateful property testing in complex consensus systems.

A minimal Rust harness (the “boring runner”)

The core trick is to keep the runner boring and put all the logic in commands. This is the same shape that scales in practice.

State and context

pub trait State: std::fmt::Debug {}

pub trait TestContext: std::fmt::Debug + Clone {}

For the examples below, assume an empty context:

#[derive(Debug, Clone, Default)]
pub struct BeaconContext;
impl TestContext for BeaconContext {}

Commands

use proptest::prelude::*;
use std::sync::Arc;

pub trait Command<S: State, C: TestContext>:
    std::fmt::Debug + Send + Sync
{
    // Precondition: is this command meaningful *now*.
    fn check(&self, state: &S) -> bool;

    // Apply the transition and assert postconditions.
    fn apply(&self, state: &mut S);

    // For debugging and shrunk traces.
    fn label(&self) -> String;

    // Generate commands.
    fn build(ctx: Arc<C>)
        -> impl Strategy<Value = CommandWrapper<S, C>>
    where
        Self: Sized;
}

Wrapper for heterogeneous sequences

#[derive(Clone)]
pub struct CommandWrapper<S: State, C: TestContext> {
    pub command: Arc<dyn Command<S, C>>,
}

impl<S: State, C: TestContext> CommandWrapper<S, C> {
    pub fn new<T>(t: T) -> Self
    where
        T: Command<S, C> + 'static,
    {
        Self { command: Arc::new(t) }
    }
}

Execution loop

pub fn execute_commands<S: State, C: TestContext>(
    commands: &[CommandWrapper<S, C>],
    state: &mut S,
) {
    for cmd in commands {
        if cmd.command.check(state) {
            cmd.command.apply(state);
        }
    }
}

The point is locality: generation, preconditions, transition logic, and invariants live together. That design choice is exactly the “data-open” side of the expression problem, and it’s why these harnesses survive contact with real systems.

A consensus-client-flavored example (with correct slot semantics)

One easy trap is to assume “there is only one block per slot”. In the spec there is one proposer per slot, but on the network you can see:

equivocations (two blocks for the same slot from the proposer)
different views due to propagation delays
reorgs that temporarily make a “worse” chain the head

So a stateful invariant should not be “reject a second block at slot (s)”. That’s not how fork choice works.

Instead, here’s a deliberately small example that matches real failure modes: idempotence by block root. If a client re-imports the same block (same root), it must not double-apply side effects.

Model state

use std::collections::{HashMap, HashSet};

#[derive(Debug, Default)]
struct BeaconModel {
    current_slot: u64,

    // Slot -> set of known block roots at that slot (forks allowed).
    known_by_slot: HashMap<u64, HashSet<[u8; 32]>>,

    // In 2026, participation is stake-weighted (MaxEB / EIP-7251).
    // Total weight of unique blocks we've imported.
    total_imported_weight: u64,

    // Track which block states are available to prevent the 2025 Prysm regression
    // (expensive state regeneration when validating attestations for uncached blocks).
    state_cache: HashSet<[u8; 32]>,
}
impl State for BeaconModel {}

Command: tick time

#[derive(Debug)]
struct TickSlot;

impl Command<BeaconModel, BeaconContext> for TickSlot {
    fn check(&self, _state: &BeaconModel) -> bool { true }

    fn apply(&self, state: &mut BeaconModel) {
        state.current_slot += 1;
    }

    fn label(&self) -> String { "TICK_SLOT".to_string() }
}

Command: import a block (stake-weighted, duplicates forbidden)

#[derive(Debug)]
struct ImportBlock {
    slot: u64,
    root: [u8; 32],
    weight: u64, // Stake-weighted via MaxEB.
}

impl Command<BeaconModel, BeaconContext> for ImportBlock {
    fn check(&self, state: &BeaconModel) -> bool {
        self.slot <= state.current_slot
    }

    fn apply(&self, state: &mut BeaconModel) {
        let entry = state
            .known_by_slot
            .entry(self.slot)
            .or_default();
        let is_new = entry.insert(self.root);

        // This is the invariant: same root must not be "new" twice.
        // Stake-weighting means a duplicate root shouldn't double-count weight.
        if is_new {
            state.total_imported_weight += self.weight;
            state.state_cache.insert(self.root);
        }
    }

    fn label(&self) -> String {
        format!("IMPORT_BLOCK(slot={}, weight={})", self.slot, self.weight)
    }
}

// The bug from December 2025: attestations for stale blocks 
// triggering expensive state regeneration.
#[derive(Debug)]
struct ProcessAttestation {
    slot: u64,
    block_root: [u8; 32],
}

impl Command<BeaconModel, BeaconContext> for ProcessAttestation {
    fn check(&self, state: &BeaconModel) -> bool {
        self.slot <= state.current_slot
    }

    fn apply(&self, state: &mut BeaconModel) {
        // Invariant: looking up state for an attestation must not 
        // cause a "miss" that triggers an expensive re-play.
        assert!(
            state.state_cache.contains(&self.block_root),
            "State cache miss for block {:?} at slot {}", 
            self.block_root, self.slot
        );
    }

    fn label(&self) -> String {
        format!("PROCESS_ATTESTATION(slot={})", self.slot)
    }
}

If the implementation accidentally increments counters, updates indexes, or applies cached transitions twice on duplicate import, a failing trace usually shrinks to something like:

[
  TICK_SLOT,
  IMPORT_BLOCK(slot=1, root=R),
  IMPORT_BLOCK(slot=1, root=R),
]

If the system is not truly idempotent, stateful testing will reduce a complex failure down to the smallest possible sequence—often just “process the same thing twice”—making the bug obvious and undeniable.

That is the shape of a lot of consensus-client failures: not “wrong return value”, but “the second time through a path, something subtle breaks”.

Stateless + stateful is the real combination

You want both:

Stateless property-based testing (PBT) for pure-ish components: SSZ encoding, BLS wrappers, serialization, bitfields. This is where Anthropic’s approach shines.
Stateful PBT for the hard parts: fork choice, finality logic, DB/replay, reorg handling, epoch boundaries.

Stateless PBT finds bugs in the bricks (SSZ, BLS). Stateful PBT finds bugs in how the bricks stack—especially in the high-stakes world of PeerDAS and stake-weighted participation.

Anthropic showed us how to check the mortar really well. This post is about the wall.

Oracles, Traces, Triage

2026-01-25T00:00:00+00:00

In 2026, “agentic” tooling is moving fast enough that yesterday’s workflow advice goes stale quickly. This series is my attempt to write down the parts that seem durable: how to give agents norms instead of scripts, how to coordinate multiple agents through the repo, and how those patterns connect back to fuzzing and stateful testing.

Why

Skepticism is healthy here. Agent outputs still need oracles, reproducibility, and a bar for correctness. The tools are changing quickly; the craft is not.

Agents can draft tests fast; the hard part is still choosing the right oracles and insisting on reproducible failures.

Examples use Claude Code because that’s what I run day-to-day, but the patterns are meant to travel to any agent that can read a codebase, run checks, and write down findings.

This is not a tutorial. It’s a practitioner’s notebook.

The lens: oracles, traces, triage

If there’s a unifying theme here, it’s that most bug-finding systems succeed or fail on three things:

Oracles — how you decide something is wrong. Not just “did it crash?”, but invariants, spec checks, and properties that reflect what you actually care about.
Traces — many expensive bugs live in sequences, not calls. Stateful testing is about generating and shrinking traces until the failure is undeniable.
Triage — search produces noise. The work is making findings reproducible, minimal, deduped, and actionable (ideally as regression tests).

That’s also a useful way to read the series:

Part 1 (skills/norms) mostly expands the oracle surface (“what matters”).
Part 2 (agent teams) scales the search and improves the artifacts that enable triage.
Part 3 (between calls/stateful) is explicitly about traces.
Part 4 (agents as fuzzers) argues agents are search tools too—and the hard parts are still oracles + triage.

Articles

Part 1 - Agent Skills and claude-lint
How to structure .claude/ context so agents adapt from norms instead of blindly following scripts (plus a linter to keep it from drifting).
Part 2 - Agent Teams and claude-swarm
A practical pattern for parallel agents that coordinate through git, with no orchestrator—because the repo can be the shared state.
Part 3 - Testing the Bugs Between Calls
How agent skills, agent swarms, and stateful testing could combine to find consensus bugs that live in traces.
Part 4 - Agents as Fuzzers
A structural analogy: both fuzzers and AI agents search for failures that require triage and oracles.

Companion post

If you want the deeper motivation for “why traces,” start here:

The Bugs Between Calls

Part 5 — Self-hosted agents on Runpod (and friends)
Turning inference into a reliable test service: latency/cost knobs, guardrails, artifacts, and how to run agent loops against real repos.
Part 6 — Quantization as a feature: cheap tests when deep reasoning isn’t needed
Using smaller/quantized models for throughput work (scaffolding, formatting, test expansion) and reserving big models for judgment-heavy steps.
Part 7 — Corpus, shrink, triage: turning agent output into a fuzzing pipeline
How to dedupe/minimize failures and turn “agent finds” into reproducible bug packets and long-lived regression corpora.

Background

Most of what I know about testing came from shipping production systems and learning in public through open source: contributing to AutoFixture starting around 2011, then maintaining Hedgehog, which once powered Echidna, an early and widely used property-based fuzzer for Ethereum smart contracts.

Along the way: Fare for regex-constrained test generation, a SplitMix port for reproducible failure discovery. Consensus fuzzers at Stacks that caught a production bug a 533-line integration test couldn’t reproduce.

That background is why I’m interested in AI tooling—not as a replacement for any of this, but as a way to do more of it.

Feedback

The ideas in this series come from daily practice—shipping agent-assisted testing tools for real protocol security work. But daily practice has blind spots.

If you think I’m wrong about something, I’d like to hear it. If you think I’m right but missing a nuance, I’d especially like to hear that.

Next: Agent Skills and claude-lint

The Expression Problem in Practice: A Trait-Based Testing Harness

2025-03-25T00:00:00+00:00

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

We started this series with a production bug that couldn’t be reproduced. We end with a framework that not only can catch that bug, but fundamentally change how we think about testing complex systems. The journey reveals practical lessons about the expression problem that extend far beyond testing.

The Design That Emerged

Through trial and error, madhouse-rs converged on a simple but powerful architecture, as described in the whitepaper commit:

Each Command follows a predictable lifecycle:

Generated by a proptest Strategy
Validated via check against current state
Applied via apply, mutating both model and real system
Verified through assertions and postconditions

Why Traits Won Over Enums

The contrast with proptest-state-machine is instructive. Consider how each approach handles a new test operation:

Enum approach (proptest-state-machine):

// 1. Add to the central enum (affects everyone).
enum SystemTransition {
    ExistingOp1,
    ExistingOp2,
    NewOperation(NewOpData), // <- New variant.
}

// 2. Update the central apply function (affects everyone).
fn apply(state: State, transition: SystemTransition) -> State {
    match transition {
        SystemTransition::ExistingOp1 => { /* existing logic */ }
        SystemTransition::ExistingOp2 => { /* existing logic */ }
        SystemTransition::NewOperation(data) => { // <- New arm.
            // New logic scattered across this central function.
        }
    }
}

// 3. Update the transitions function (affects everyone).
fn transitions() -> BoxedStrategy<SystemTransition> {
    prop_oneof![
        existing_strategy_1(),
        existing_strategy_2(),
        new_operation_strategy(), // <- New generator.
    ].boxed()
}

Trait approach (madhouse-rs):

// Self-contained - zero impact on existing code.
struct NewOperationCommand {
    data: NewOpData,
}

impl Command<SystemState, SystemContext> for NewOperationCommand {
    fn check(&self, state: &SystemState) -> bool {
        // Preconditions logic here.
    }

    fn apply(&self, state: &mut SystemState) {
        // Application logic here.
    }

    fn label(&self) -> String {
        format!("NEW_OPERATION({:?})", self.data)
    }

    fn build(ctx: Arc<SystemContext>) -> impl Strategy<Value = CommandWrapper<SystemState, SystemContext>> {
        // Generation strategy here.
        new_operation_strategy()
            .prop_map(|data| CommandWrapper::new(NewOperationCommand { data }))
    }
}

The difference is profound: trait-based commands are autonomous. All logic—generation, preconditions, application, and labeling—lives in one place. No coordination required.

Real-World Scale: The PoX-4 Experience

Before madhouse-rs, we applied these principles with Radu Bahmata to test the Proof-of-Transfer (PoX-4) consensus using TypeScript and fast-check. The harness grew to include 20+ command types, each testing different aspects of the staking protocol:

StackStxCommand - Delegate STX tokens to a stacker
DelegateStxCommand - Delegate stacking rights to a pool
StackAggregationCommitCommand - Commit aggregated stacking transactions
RevokeDelegateStxCommand - Revoke previously delegated stacking rights
StackExtendCommand - Extend an existing stacking commitment
GetStackerInfoCommand - Query stacker information and verify state
… and many, many, more.

The key insight: each command class was self-contained. A developer could add StackExtendCommand without understanding the internals of DelegateStxCommand. The framework composed them automatically.

When a test failed after 200+ operations, the shrinking algorithm would reduce it to something like:

Original sequence: [200+ operations]
Shrunk to: [
    DelegateStx(account, pool),
    StackAggregationCommit(pool, account),
    RevokeDelegateStx(account),
    StackAggregationCommit(pool, account)
]

This four-step sequence revealed a subtle bug: revoking delegation didn’t properly invalidate pending aggregation commits. Finding this manually would have taken weeks.

Lessons for System Design

The expression problem appears everywhere in software design, not just testing frameworks:

1. Plugin Architectures

Want users to extend your system with new functionality? Choose the “data-open” side—make plugins implement traits rather than forcing them to modify central enums.

2. Event Systems

Need to handle dozens of event types? Each event type should be its own struct implementing an Event trait, not variants in a central enum.

3. Command Patterns

Building a command-line tool with subcommands? Each subcommand should be its own type, not a variant in a central enum.

4. Middleware Systems

Web frameworks often choose the “data-open” side: each middleware is its own type implementing a common trait.

The Cost of Getting It Wrong

We’ve seen both sides of this trade-off in practice:

When the enum approach breaks down:

Central files become merge conflict magnets.
Adding new variants requires understanding the entire system.
Logic becomes scattered across multiple functions.
New contributors face a high barrier to entry.

When the trait approach breaks down:

Adding new operations to the trait forces updates everywhere.
Abstract operations are harder to optimize.
Dynamic dispatch can impact performance.
Trait objects introduce complexity.

For madhouse-rs, the trade-off was clear: we needed to add new test operations constantly, but the core operations (check, apply, label, build) were stable. The “data-open” choice was correct.

Performance Considerations

One concern with trait-based approaches is performance. CommandWrapper uses Arc>, which involves heap allocation and dynamic dispatch. In our testing scenarios, this overhead was negligible compared to the actual blockchain operations being tested.

The Full Circle

We began with a simple question: how do you design systems that are easy to extend? The expression problem provided the theoretical framework, but the real learning came from building systems that needed to scale.

The Stacks blockchain bug that started this journey taught us that complexity is the enemy of correctness. Traditional testing assumes you can predict where bugs hide. Model-based testing with madhouse-rs assumes you can’t—so it generates the chaos systematically.

The trait-based design made this scalable. Instead of a monolithic test harness that becomes unmaintainable, we have an ecosystem of autonomous commands that compose naturally.

Practical Takeaways

Choose your trade-off consciously: The expression problem forces a choice. Understanding the trade-off helps you pick the right tool.
Favor autonomy at scale: When systems grow large, autonomous components (traits) usually scale better than centralized ones (enums).
Let chaos find the bugs: For complex systems, generated test scenarios often find bugs that manual tests miss.
Design for shrinking: When random tests fail, automatic reduction to minimal cases is invaluable.
Start simple, then scale: Both approaches work for small systems. The difference emerges at scale.

The expression problem isn’t academic theory—it’s a practical design constraint that affects every system you build. Understanding it helps you make better architectural choices, whether you’re building testing frameworks, plugin systems, or distributed applications.

In the end, good design isn’t about avoiding trade-offs. It’s about making them consciously, understanding their implications, and choosing the ones that align with how your system needs to grow.

References and Further Reading

The ideas in this series draw from decades of research and practice:

Series Complete: Model-Based Stateful Testing with madhouse-rs series.

Chaos Testing stacks-node with Model-Based Stateful Testing

2025-03-10T00:00:00+00:00

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

Theory is useful, but does the trait-based design actually work in practice? Can it scale to test a real, complex distributed system? The answer came from an unexpected place: a production bug in the Stacks blockchain that refused to be reproduced.

The Bug That Couldn’t Be Caught

In early 2024, Stacks mainnet experienced a stall. After a reorg, miners would occasionally fail to build on their own blocks, disrupting the consensus mechanism. The behavior was intermittent and seemed to depend on precise timing and network conditions.

Core developer Brice Dobry attempted to write a traditional test—a masterfully crafted 533-line integration test with sophisticated setup and coordination:

“In this test, I attempted to reproduce the scenario we saw in mainnet, in which the miner mines a tenure change block in this reorg scenario, but then fails to mine another block building off of that one. I was unable to reproduce that behavior, but this still seems like a useful test to have.”

The test included complex manual orchestration:

Detailed miner setup with specific configuration.
Manual transaction submission and timing coordination.
Explicit waiting periods and state verification.
Hundreds of lines of boilerplate setup code.

Yet even with Brice’s expertise and this carefully crafted test, the production bug remained elusive. This wasn’t a reflection of the test quality—it highlighted just how subtle and context-dependent the bug was. That’s when a radical idea emerged:

“We could shift to a command-based model test. Each step, like ‘miner commits block’ or ‘signer accepts block,’ becomes its own command that updates a small state model and the actual chain. Then we run random sequences of these commands to reveal hidden corners.”

This comment became the genesis of madhouse-rs.

From Idea to Implementation

The insight was profound: instead of trying to predict where bugs might hide, let chaos find them. Model the entire blockchain testing scenario as a collection of autonomous commands, then generate thousands of random sequences.

Here’s how it looked in practice. The test harness included commands like:

Note: The following examples are conceptual illustrations that demonstrate the core patterns. The actual implementation uses more complex blockchain-specific types and operations.

// Each command encapsulates one blockchain operation.
struct MineBitcoinBlockCommand {
    pub block_height: u64,
}

impl Command<StacksState, StacksContext> for MineBitcoinBlockCommand {
    fn check(&self, state: &StacksState) -> bool {
        // Only mine if we're not too far ahead.
        self.block_height <= state.tip_height + 10
    }

    fn apply(&self, state: &mut StacksState) {
        // Update both the model state and the actual blockchain.
        state.bitcoin_blocks.push(self.block_height);
        state.tip_height = self.block_height;

        // Actual blockchain interaction.
        mine_bitcoin_block(self.block_height);

        // Verify post-conditions.
        assert_eq!(get_bitcoin_tip_height(), self.block_height);
    }

    fn label(&self) -> String {
        format!("MINE_BITCOIN_BLOCK({})", self.block_height)
    }

    fn build(ctx: Arc<StacksContext>) -> impl Strategy<Value = CommandWrapper<StacksState, StacksContext>> {
        (ctx.current_height..ctx.current_height + 5)
            .prop_map(|height| CommandWrapper::new(MineBitcoinBlockCommand { block_height: height }))
    }
}

struct SubmitBlockCommitCommand {
    pub miner_id: u32,
    pub bitcoin_block_height: u64,
}

impl Command<StacksState, StacksContext> for SubmitBlockCommitCommand {
    fn check(&self, state: &StacksState) -> bool {
        // Can only commit if the Bitcoin block exists.
        state.bitcoin_blocks.contains(&self.bitcoin_block_height) &&
        state.miners.contains_key(&self.miner_id)
    }

    fn apply(&self, state: &mut StacksState) {
        // Track the commit in model state.
        state.block_commits.push(BlockCommit {
            miner: self.miner_id,
            bitcoin_height: self.bitcoin_block_height,
        });

        // Submit to actual blockchain.
        submit_block_commit(self.miner_id, self.bitcoin_block_height);
    }

    fn label(&self) -> String {
        format!("SUBMIT_COMMIT(miner={}, btc_height={})", self.miner_id, self.bitcoin_block_height)
    }

    fn build(ctx: Arc<StacksContext>) -> impl Strategy<Value = CommandWrapper<StacksState, StacksContext>> {
        let miners = ctx.miners.clone();
        let heights = ctx.available_bitcoin_heights.clone();

        (prop::sample::select(miners), prop::sample::select(heights))
            .prop_map(|(miner, height)| {
                CommandWrapper::new(SubmitBlockCommitCommand {
                    miner_id: miner,
                    bitcoin_block_height: height,
                })
            })
    }
}

The Breakthrough: Real Chaos, Real Bugs

The actual test scenario that finally reproduced the production bug looked like this (from the PR that fixed it, where both Brice’s traditional script and the madhouse-rs approach successfully reproduced the issue):

scenario![
    test_context,
    SkipCommitOpMiner2,
    BootToEpoch3,
    SkipCommitOpMiner1,
    PauseStacksMining,
    MineBitcoinBlock,
    VerifyMiner1WonSortition,
    SubmitBlockCommitMiner2,
    ResumeStacksMining,
    WaitForTenureChangeBlockFromMiner1,
    MineBitcoinBlock,
    VerifyMiner2WonSortition,
    VerifyLastSortitionWinnerReorged,
    WaitForTenureChangeBlockFromMiner2,
    ShutdownMiners
]

But here’s the key: this wasn’t the only sequence tested. When run with MADHOUSE=1, the framework generated thousands of variations:

What if MineBitcoinBlock happened before SubmitBlockCommitMiner2?
What if PauseStacksMining occurred at different points?
What if multiple miners competed in different orders?

One of these chaotic permutations finally triggered the exact conditions that caused the production bug. The test failed, and the framework automatically shrunk the failing sequence to a minimal reproduction case.

The Power of Shrinking

When madhouse-rs found a failing test scenario, it didn’t just report a 200-step chaos sequence. The framework systematically removed operations until it found the minimal case that still triggered the bug:

Original failing sequence: [120 operations...]
Shrunk to: [
    MineBitcoinBlock,
    DisconnectNode(node_2),
    SubmitBlockCommit(miner_1),
    ReconnectNode(node_2)
]

This minimal reproduction became the foundation for understanding and fixing the bug. What would have taken weeks of manual debugging was reduced to a four-step reproduction script.

Why Traditional Testing Failed

The bug existed at the intersection of:

Network timing (when nodes reconnected).
Blockchain state (which blocks were mined when).
Miner behavior (who submitted commits and when).

Traditional integration tests assume you can predict these intersections. They script specific scenarios: “First do X, then Y, then Z.” But production bugs don’t follow scripts—they emerge from the unexpected combinations that nobody thought to test.

Model-based testing with madhouse-rs reverses this assumption: instead of predicting where bugs live, generate the combinations and let the bugs reveal themselves.

The Technical Architecture

The success of this approach depended on the trait-based design:

// The complete test setup.
#[derive(Debug, Default)]
struct StacksTestState {
    bitcoin_blocks: Vec<u64>,
    stacks_blocks: Vec<StacksBlock>,
    miners: HashMap<u32, MinerState>,
    network_partitions: Vec<Partition>,
    // ... dozens more fields tracking blockchain state.
}

impl State for StacksTestState {}

// Context with test parameters.
#[derive(Debug, Clone)]
struct StacksTestContext {
    num_miners: u32,
    bitcoin_block_time: Duration,
    network_delay_range: (Duration, Duration),
    // ... configuration parameters.
}

impl TestContext for StacksTestContext {}

Each command was self-contained. Adding a new blockchain operation—like NetworkPartitionCommand or RestartNodeCommand—required zero changes to existing commands. The trait-based design made it possible to build a test harness with 50+ distinct operations, each developed and tested independently.

The Real-World Impact

This wasn’t just an academic exercise. The chaos testing approach:

Found the production bug that traditional testing missed.
Provided a minimal reproduction case for debugging.
Validated the fix by running thousands of variations to ensure the bug was truly resolved.
Enabled ongoing regression testing with the same chaos generation.

The framework runs in CI, continuously generating new chaotic scenarios to catch regressions before they reach production.

From Chaos to Confidence

The lesson isn’t that traditional testing is worthless—it’s that certain classes of bugs only emerge from chaos. Race conditions, timing issues, and complex state interactions hide in the combinations that manual tests never explore.

Model-based testing with madhouse-rs turns chaos into a systematic testing strategy. The trait-based design makes it sustainable at scale. The automatic shrinking makes failures actionable.

This is how we can move from “I was unable to reproduce that behavior” to reproducible test cases that can catch production bugs before they happen.

Next: The Expression Problem in Practice: A Trait-Based Testing Harness

Scaling Model-Based Stateful Testing with madhouse-rs

2025-02-10T00:00:00+00:00

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

In the previous post, we saw how proptest-state-machine’s enum-based design becomes a bottleneck when scaling to hundreds of operations. What if there was a different approach—one that embraced the “data-open” side of the expression problem?

madhouse-rs was born from this exact frustration. When trying to reproduce that elusive Stacks mainnet bug, the traditional enum approach simply couldn’t scale to the complexity needed.

The Trait-Based Approach

Instead of a central enum, madhouse-rs makes each command its own type implementing a stable Command trait. There is no central bottleneck—no enum to extend, no monolithic match statement to update.

Let’s return to our counter example from the previous post to see how this trait-based approach works in practice:

use madhouse::prelude::*;
use proptest::prelude::*;
use std::sync::Arc;

// Define your state and context.
#[derive(Debug, Default)]
struct CounterState {
    value: u64,
    max_value: u64,
}
impl State for CounterState {}

#[derive(Debug, Clone, Default)]
struct CounterContext {
    increment_range: (u64, u64),
}
impl TestContext for CounterContext {}

// Each operation is its own self-contained type.
struct IncrementCommand {
    amount: u64,
}

impl Command<CounterState, CounterContext> for IncrementCommand {
    // Check preconditions against the model state.
    fn check(&self, state: &CounterState) -> bool {
        state.value + self.amount <= state.max_value
    }

    // Apply the command to both model and real system.
    fn apply(&self, state: &mut CounterState) {
        state.value += self.amount;
        // In a real test, you'd also apply to the actual system here.
        println!("Incremented counter by {}, now at {}", self.amount, state.value);
    }

    // Human-readable label for debugging.
    fn label(&self) -> String {
        format!("INCREMENT({})", self.amount)
    }

    // Strategy for generating instances of this command.
    fn build(
        ctx: Arc<CounterContext>,
    ) -> impl Strategy<Value = CommandWrapper<CounterState, CounterContext>> {
        let (min, max) = ctx.increment_range;
        (min..=max).prop_map(|amount| CommandWrapper::new(IncrementCommand { amount }))
    }
}

struct ResetCommand;

impl Command<CounterState, CounterContext> for ResetCommand {
    fn check(&self, state: &CounterState) -> bool {
        state.value > 0  // Only reset if there's something to reset.
    }

    fn apply(&self, state: &mut CounterState) {
        state.value = 0;
        println!("Counter reset to 0");
    }

    fn label(&self) -> String {
        "RESET".to_string()
    }

    fn build(
        _ctx: Arc<CounterContext>,
    ) -> impl Strategy<Value = CommandWrapper<CounterState, CounterContext>> {
        Just(CommandWrapper::new(ResetCommand))
    }
}

Running the Scenario

With madhouse-rs, you compose test scenarios using the scenario! macro:

fn test_counter_chaos() {
    let test_context = Arc::new(CounterContext {
        increment_range: (1, 100),
    });

    // Run the scenario - madhouse-rs handles the rest.
    scenario![
        test_context,
        IncrementCommand,
        ResetCommand,
        (IncrementCommand { amount: 42 })  // Fixed command instance.
    ];
}

The Power of Data-Open Design

What makes this approach scale? Each command is autonomous:

Self-contained logic: Generation, preconditions, and application logic all live together.
No central bottleneck: Adding DecrementCommand requires zero edits to existing code.
Composable: Mix and match commands freely in different test scenarios.
Maintainable: Each command can be developed, tested, and reviewed independently.

Real-World Impact

Update (June 14, 2025): This design proved its worth in the Stacks blockchain testing. Consider this actual test scenario from the stacks-core PR #6007 that was merged yesterday:

scenario![
    test_context,
    SkipCommitOpMiner2,
    BootToEpoch3,
    SkipCommitOpMiner1,
    PauseStacksMining,
    MineBitcoinBlock,
    VerifyMiner1WonSortition,
    SubmitBlockCommitMiner2,
    ResumeStacksMining,
    WaitForTenureChangeBlockFromMiner1,
    MineBitcoinBlock,
    VerifyMiner2WonSortition,
    VerifyLastSortitionWinnerReorged,
    WaitForTenureChangeBlockFromMiner2,
    ShutdownMiners
]

Each of those 14+ operations is a self-contained Command implementation. No central enum to maintain. No monolithic match statement. No coordination between developers adding new test operations.

More importantly, when the framework runs with MADHOUSE=1, it generates random permutations of these operations, creating chaotic scenarios that manual tests could never explore. This is how the framework can reproduce production bugs that traditional testing might miss.

The Expression Problem Solved

By choosing the “data-open” side, madhouse-rs makes it trivial to add new command types while keeping the core operations (check, apply, label, build) stable. This is exactly the opposite trade-off from proptest-state-machine, and for model-based testing at scale, it’s the right choice.

Next: Chaos Testing stacks-node with Model-Based Stateful Testing

Model-Based Stateful Testing with proptest-state-machine

2025-01-10T00:00:00+00:00

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

Imagine trying to test a distributed system with dozens of operations:

miners submitting blocks
nodes joining and leaving
transactions flooding the mempool
network partitions
and more.

Traditional unit tests can’t capture the chaotic, interleaved nature of these scenarios.

Model-based testing offers a solution: define all possible operations, let the framework generate random sequences, and check that your system behaves correctly. But… how you structure those operations determines whether your test harness scales from 5 commands to 500.

proptest-state-machine sits firmly on the operations-open, data-closed side of the expression problem. You define a central Transition enum that lists every possible operation.

The Enum Approach

Let’s explore this with a simple counter example first—while the real-world blockchain scenarios would be more complex to demonstrate initially, the core design patterns are identical. We’ll see the blockchain applications later in the series.

use proptest_state_machine::*;
use proptest::prelude::*;

// The model state.
#[derive(Clone, Debug)]
struct CounterModel {
    value: i32,
    max_value: i32,
}

// The real system under test.
struct Counter {
    value: i32,
    max_value: i32,
}

// All possible operations in one enum.
#[derive(Clone, Debug)]
enum CounterTransition {
    Inc,
    Dec,
    Reset,
}

impl StateMachine for CounterModel {
    type State = CounterModel;
    type Sut = Counter;
    type Transition = CounterTransition;

    fn init_state() -> BoxedStrategy<Self::State> {
        Just(CounterModel { value: 0, max_value: 100 }).boxed()
    }

    fn init_sut(state: &Self::State) -> BoxedStrategy<Self::Sut> {
        Just(Counter {
            value: state.value,
            max_value: state.max_value
        }).boxed()
    }

    fn transitions(_state: &Self::State) -> BoxedStrategy<Self::Transition> {
        prop_oneof![
            Just(CounterTransition::Inc),
            Just(CounterTransition::Dec),
            Just(CounterTransition::Reset),
        ].boxed()
    }

    fn apply(
        mut state: Self::State,
        sut: &mut Self::Sut,
        transition: Self::Transition,
    ) -> Self::State {
        match transition {
            CounterTransition::Inc => {
                if state.value < state.max_value {
                    state.value += 1;
                    sut.value += 1;
                }
            }
            CounterTransition::Dec => {
                state.value -= 1;
                sut.value -= 1;
            }
            CounterTransition::Reset => {
                state.value = 0;
                sut.value = 0;
            }
        }
        state
    }

    fn postconditions(state: &Self::State, sut: &Self::Sut) {
        assert_eq!(state.value, sut.value);
        assert_eq!(state.max_value, sut.max_value);
        assert!(state.value <= state.max_value);
    }
}

The Scalability Problem

This approach starts simple, but every new operation requires:

Adding a variant to the central CounterTransition enum.
Updating the apply function with a new match arm.
Updating the transitions function to include the new operation.

With 10 operations, this is manageable. With 100+ operations—like testing a blockchain node—it becomes unwieldy. The apply function grows into a monolithic match statement. Every developer adding a test command must touch this central file.

Real-World Complexity

Consider testing the Stacks blockchain, where operations include:

enum StacksTransition {
    MineBitcoinBlock,
    MineStacksBlock(BlockData),
    SubmitTransaction(Transaction),
    ConnectPeer(PeerInfo),
    DisconnectPeer(PeerId),
    NetworkPartition(Vec<NodeId>),
    RestoreNetwork,
    RestartNode(NodeId),
    // ... 50+ more operations.
}

The apply function becomes hundreds of lines. Adding new test scenarios requires editing this central bottleneck. Worse, the logic for each command is scattered—generation in transitions, preconditions mixed into apply, and postconditions in a separate function.

This scaling limitation becomes apparent in complex scenarios like reproducing a Stacks mainnet bug. While that specific attempt used hand-written tests, the enum-based approach would face the same bottleneck—dozens of blockchain operations in a central enum and apply function, making it difficult to generate the chaotic scenarios needed to reveal subtle consensus issues.

Next: Scaling Model-Based Stateful Testing with madhouse-rs

blog

Agents as Fuzzers

The short version

Two search tools

The anatomy, side by side

What changes when the searcher understands context

The convergence

Fuzzers still win at

Agents win at

The spectrum

In practice

Related posts

Testing the Bugs Between Calls

The short version

The problem

What agents could add

Three pieces

1. Agent skills define testing norms

2. Agent swarms parallelize exploration

3. The feedback loop tightens

Where this probably won’t work

The combination

Related posts

Agent Teams and claude-swarm

One agent hits a ceiling

The agent-team pattern

claude-swarm

Why no orchestrator

Specialization is possible, not required

When it works, when it doesn’t

What this is really about

Related posts

Agent Skills and claude-lint

The temptation

Context should shape reasoning, not script behavior

The layers

claude-lint

Why this matters in practice

Related posts

The Bugs Between Calls

Stateless properties (and where they stop)

Consensus clients are state machines

Examples of stateful invariants in consensus clients

Model-based, stateful property-based testing

A minimal Rust harness (the “boring runner”)

State and context

Commands

Wrapper for heterogeneous sequences

Execution loop

A consensus-client-flavored example (with correct slot semantics)

Model state

Command: tick time

Command: import a block (stake-weighted, duplicates forbidden)

Stateless + stateful is the real combination

Further Reading

Related posts

Oracles, Traces, Triage

Why

The lens: oracles, traces, triage

Articles

Companion post

Next

Background

Feedback

The Expression Problem in Practice: A Trait-Based Testing Harness

The Design That Emerged

Why Traits Won Over Enums

Real-World Scale: The PoX-4 Experience

Lessons for System Design

1. Plugin Architectures

2. Event Systems

3. Command Patterns

4. Middleware Systems

The Cost of Getting It Wrong

Performance Considerations

The Full Circle

Practical Takeaways

References and Further Reading

Chaos Testing stacks-node with Model-Based Stateful Testing

The Bug That Couldn’t Be Caught