<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://blog.nikosbaxevanis.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://blog.nikosbaxevanis.com/" rel="alternate" type="text/html" /><updated>2026-02-16T14:20:14+00:00</updated><id>https://blog.nikosbaxevanis.com/feed.xml</id><title type="html">blog</title><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><entry><title type="html">Agents as Fuzzers</title><link href="https://blog.nikosbaxevanis.com/2026/02/16/agents-as-fuzzers/" rel="alternate" type="text/html" title="Agents as Fuzzers" /><published>2026-02-16T00:00:00+00:00</published><updated>2026-02-16T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/02/16/agents-as-fuzzers</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/02/16/agents-as-fuzzers/"><![CDATA[<p><em>This article is part of the Oracles, Traces, Triage <a href="/2026/01/25/oracles-traces-triage/">series</a>.</em></p>

<h2 id="the-short-version">The short version</h2>

<p><strong>A fuzzer is a search tool whose results must be triaged. An AI agent is a search tool whose results must be triaged.</strong> The parallel is not metaphorical. I think it’s structural.</p>

<h2 id="two-search-tools">Two search tools</h2>

<p><strong>A fuzzer</strong> explores the input space of a program, looking for inputs that violate some oracle—a crash, a hang, a property violation. When it finds something, you triage: real bug? Duplicate? Exploitable?</p>

<p><strong>An AI agent</strong> explores the solution space of a problem, looking for code, fixes, or tests that satisfy some goal. When it produces something, you triage: correct? Complete? Does it address the problem?</p>

<p>Both search. Both produce results that need judgment. Both waste enormous time if pointed in the wrong direction.</p>

<h2 id="the-anatomy-side-by-side">The anatomy, side by side</h2>

<p><strong>Every fuzzer</strong> does four things:</p>

<ol>
  <li><strong>Generates inputs</strong> (random, mutational, grammar-based, coverage-guided)</li>
  <li><strong>Executes the target</strong> with those inputs</li>
  <li><strong>Checks an oracle</strong> (crash? new coverage? property violation?)</li>
  <li><strong>Saves interesting results</strong> for triage</li>
</ol>

<p><strong>Every AI agent</strong> does the same four things:</p>

<ol>
  <li><strong>Generates candidates</strong> (from prompt, codebase, agent skills)</li>
  <li><strong>Executes or applies</strong> them (writes code, runs tests, modifies files)</li>
  <li><strong>Checks an oracle</strong> (tests pass? linter clean? invariants hold?)</li>
  <li><strong>Saves results</strong> for triage (commits, PRs, logs)</li>
</ol>

<p>Replace “inputs” with “candidates” and “crash” with “test failure.” The structure is identical.</p>

<h2 id="what-changes-when-the-searcher-understands-context">What changes when the searcher understands context</h2>

<p>Traditional fuzzers are <strong>context-blind</strong>. AFL doesn’t know what a function does. libFuzzer doesn’t understand the specification. They compensate with <strong>volume</strong>—millions of executions per second.</p>

<p>Context-blindness has costs:</p>

<ul>
  <li><strong>Shallow oracles.</strong> “Did it crash?” works. “Does this violate the protocol invariant?” requires a custom harness—often harder to write than the code being tested.</li>
  <li><strong>Redundant exploration.</strong> Without understanding structure, the fuzzer wastes cycles in uninteresting regions of input space.</li>
  <li><strong>Triage burden.</strong> Many findings are duplicates, benign panics, or expected edge cases. You sort the signal from the noise.</li>
</ul>

<p>An AI agent, by contrast:</p>

<ul>
  <li>Can <strong>read the specification</strong></li>
  <li>Can <strong>reason about</strong> which inputs trigger interesting behavior</li>
  <li>Can <strong>write its own oracle</strong> and generate inputs designed to challenge it</li>
</ul>

<p>The search becomes <strong>intentional</strong> without becoming rigid.</p>

<h2 id="the-convergence">The convergence</h2>

<p>Combine the pieces from this series:</p>

<table>
  <thead>
    <tr>
      <th>Piece</th>
      <th>Fuzzer equivalent</th>
      <th>What it adds</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="/2026/02/01/agent-skills-and-claude-lint/">Agent skills</a></td>
      <td>Oracle</td>
      <td>Richer than “did it crash?”—norms that agents translate into testable properties</td>
    </tr>
    <tr>
      <td><a href="/2026/02/08/agent-teams-and-claude-swarm/">Agent swarms</a></td>
      <td>Multiple seeds</td>
      <td>Parallel search where each instance can <strong>specialize</strong>, sharing findings via git</td>
    </tr>
    <tr>
      <td><a href="/2026/02/15/testing-between-calls-with-agents/">Stateful testing</a></td>
      <td>Execution loop</td>
      <td>For <strong>traces</strong> instead of single inputs</td>
    </tr>
  </tbody>
</table>

<p>Together: <strong>context-aware search, parallel exploration, rich oracles</strong>.</p>

<h2 id="fuzzers-still-win-at">Fuzzers still win at</h2>

<ul>
  <li><strong>Speed.</strong> Millions of executions/sec with a simple oracle (“did it crash?”). AFL and libFuzzer are unbeatable here.</li>
  <li><strong>Binary targets.</strong> No source code, no spec? Blind fuzzing is often the only option.</li>
  <li><strong>Deterministic reproduction.</strong> Fuzzers produce exact inputs. Agent traces may need work to become deterministic.</li>
  <li><strong>Corpus management.</strong> Mature fuzzers have corpus minimization, coverage tracking, seed scheduling. Agent ecosystems don’t—yet.</li>
</ul>

<h2 id="agents-win-at">Agents win at</h2>

<ul>
  <li><strong>Rich invariants.</strong> “Does this sequence of state transitions preserve safety properties?” An agent can both <em>formulate</em> and <em>check</em> the invariant.</li>
  <li><strong>Spec-guided search.</strong> When the spec exists and is readable, agents generate targeted campaigns rather than relying on coverage alone.</li>
  <li><strong>Triage.</strong> An agent can produce a root-cause hypothesis before you ever see the failure. It can check for duplicates.</li>
  <li><strong>Harness generation.</strong> Writing fuzz harnesses is expert work. Agents can draft them from specs and iterate.</li>
</ul>

<h2 id="the-spectrum">The spectrum</h2>

<table>
  <thead>
    <tr>
      <th> </th>
      <th>Traditional Fuzzer</th>
      <th>AI Agent</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Input generation</td>
      <td>Random / mutational / grammar</td>
      <td>Context-aware / intentional</td>
    </tr>
    <tr>
      <td>Oracle</td>
      <td>Crash / coverage / property</td>
      <td>Natural-language norm → property</td>
    </tr>
    <tr>
      <td>Speed</td>
      <td>Millions of executions/sec</td>
      <td>Seconds to minutes per session</td>
    </tr>
    <tr>
      <td>Context understanding</td>
      <td>None</td>
      <td>Deep</td>
    </tr>
    <tr>
      <td>Triage</td>
      <td>Manual</td>
      <td>Agent-assisted</td>
    </tr>
    <tr>
      <td>Parallelism</td>
      <td>Independent seeds</td>
      <td>Coordinated via git</td>
    </tr>
  </tbody>
</table>

<p>The gap is narrowing. What matters is understanding which tool fits which problem—and being willing to combine them.</p>

<h2 id="in-practice">In practice</h2>

<ul>
  <li><strong>Traditional fuzzers</strong> for the fast, low-level search—serialization, encoding edge cases, roundtrip invariants. Simple oracles, enormous input spaces. <strong>Volume wins.</strong></li>
  <li><strong>AI agents</strong> for the slow, high-level search—stateful invariants, cross-component interactions, spec compliance. Complex oracles, understanding required. <strong>Context wins.</strong></li>
  <li><strong>Both together</strong>—agents generating hypotheses and fuzz harnesses, fuzzers executing at speed, agents triaging the results.</li>
</ul>

<p>Fuzzing was barely known outside security research fifteen years ago. Standard practice after AFL and OSS-Fuzz. Table stakes today.</p>

<p>AI-assisted testing is on the same trajectory.</p>

<h2 id="related-posts">Related posts</h2>

<ul>
  <li><a href="/2026/01/25/oracles-traces-triage/">Oracles, Traces, Triage</a> (series index)</li>
  <li><a href="/2026/02/01/agent-skills-and-claude-lint/">Agent Skills and claude-lint</a></li>
  <li><a href="/2026/02/08/agent-teams-and-claude-swarm/">Agent Teams and claude-swarm</a></li>
  <li><a href="/2026/02/15/testing-between-calls-with-agents/">Testing the Bugs Between Calls</a></li>
  <li><a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a></li>
  <li><a href="/2023/03/03/property-tests-vs-fuzzing/">Property Tests Are Not A Fuzzer</a></li>
  <li><a href="/2023/12/15/fuzzing-meets-property-testing/">Fuzzing meets property testing</a></li>
</ul>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part 4 of the Oracles, Traces, Triage series. A fuzzer is a search tool whose results must be triaged. An AI agent is a search tool whose results must be triaged. Perhaps that's not a coincidence.]]></summary></entry><entry><title type="html">Testing the Bugs Between Calls</title><link href="https://blog.nikosbaxevanis.com/2026/02/15/testing-between-calls-with-agents/" rel="alternate" type="text/html" title="Testing the Bugs Between Calls" /><published>2026-02-15T00:00:00+00:00</published><updated>2026-02-15T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/02/15/testing-between-calls-with-agents</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/02/15/testing-between-calls-with-agents/"><![CDATA[<p><em>This article is part of the Oracles, Traces, Triage <a href="/2026/01/25/oracles-traces-triage/">series</a>.</em></p>

<h2 id="the-short-version">The short version</h2>

<p><strong>Agent skills + agent swarms + stateful testing could compound into something stronger than any piece alone.</strong> While this combination hasn’t been tested at scale with agents yet, the individual components have proven effective in practice. The following explores how these pieces might integrate.</p>

<h2 id="the-problem">The problem</h2>

<p>In <a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a>, I argued:</p>

<ul>
  <li>The most expensive failures don’t live in <strong>single function calls</strong></li>
  <li>They live in <strong>sequences</strong>—valid operations that, composed under load, trigger liveness incidents</li>
  <li><strong>Stateless</strong> property-based testing catches bugs in the bricks</li>
  <li><strong>Stateful</strong> property-based testing catches bugs in how the bricks stack</li>
</ul>

<p>The December 2025 Prysm incident, the May 2023 finality delays—stacking failures. Every individual operation was valid. The trace was the problem.</p>

<p>If you want another non-Ethereum example of “trace-shaped” failures, the Stacks PoX-2 <code class="language-plaintext highlighter-rouge">stack-increase</code> bug is a good one to skim (<a href="https://forum.stacks.org/t/a-bug-in-stacks-increase-call-is-impacting-stacking-rewards-this-cycle/14867">thread</a>). I wasn’t at Stacks at the time, and I’m not claiming I would have caught it; the point is simply that these failures often emerge from sequences and accounting state, not single calls.</p>

<h2 id="what-agents-could-add">What agents could add</h2>

<p>Stateful property-based testing (described in <a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a>) generates <strong>random command sequences</strong>, runs them, checks invariants after each step.</p>

<p>The framework doesn’t know <em>why</em> a particular sequence might be interesting. It just tries many and hopes to stumble on something broken.</p>

<p>An AI agent is not blind:</p>

<ul>
  <li>It can <strong>read the specification</strong></li>
  <li>It can <strong>study past incidents</strong></li>
  <li>It can <strong>reason about which sequences</strong> are likely to trigger interesting states</li>
</ul>

<p>It wouldn’t replace the random search. It would augment it with <strong>intentional exploration</strong>.</p>

<h2 id="three-pieces">Three pieces</h2>

<h3 id="1-agent-skills-define-testing-norms">1. Agent skills define testing norms</h3>

<p>Following the <a href="/2026/02/01/agent-skills-and-claude-lint/">agent skills philosophy</a>, you wouldn’t give agents step-by-step procedures. You’d give them <strong>norms</strong>:</p>

<ul>
  <li><strong>Idempotent imports:</strong> importing the same block twice must not double-apply side effects</li>
  <li><strong>Epoch boundaries:</strong> boundary logic must not run twice across reorgs</li>
  <li><strong>Invariant preservation:</strong> state transitions must preserve declared invariants</li>
</ul>

<p>These go into <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> and agent skills. The agent figures out <em>how</em> to test them. At least, that’s the idea.</p>

<h3 id="2-agent-swarms-parallelize-exploration">2. Agent swarms parallelize exploration</h3>

<p>With <a href="/2026/02/08/agent-teams-and-claude-swarm/">claude-swarm</a>, you could run <strong>multiple agents against the same codebase</strong>, each exploring a different class of invariant:</p>

<ul>
  <li>One explores <strong>idempotence</strong></li>
  <li>Another targets <strong>epoch boundaries</strong></li>
  <li>Another maintains the <strong>test infrastructure</strong></li>
</ul>

<p>Each agent pushes to the same repo. When one discovers a failing trace, the others see it on the next fetch.</p>

<p>No message passing needed—the test failures <em>are</em> the messages.</p>

<h3 id="3-the-feedback-loop-tightens">3. The feedback loop tightens</h3>

<p>Today, you get a shrunk counterexample and figure out what it means yourself. With agents, the cycle could become:</p>

<ol>
  <li>Agent generates command sequences <strong>based on norms</strong></li>
  <li>proptest executes them, <strong>finds a failure</strong></li>
  <li>proptest <strong>shrinks</strong> the failure to a minimal trace</li>
  <li>Agent reads the trace, <strong>generates a root-cause hypothesis</strong></li>
  <li>Agent <strong>writes a regression test</strong></li>
</ol>

<p>Steps 4 and 5 are currently hours of manual work. They wouldn’t be free with agents—output still needs triage. But the iteration speed could be fundamentally different.</p>

<p>While this complete loop hasn’t been tested end-to-end yet, each component exists and has proven valuable in isolation.</p>

<h2 id="where-this-probably-wont-work">Where this probably won’t work</h2>

<ul>
  <li><strong>Enormous state spaces, simple invariants.</strong> If your oracle is “did it crash,” a traditional fuzzer wins. Agents are slow by comparison.</li>
  <li><strong>Precise mathematical constraints.</strong> When the goal is “find the exact input satisfying this formal constraint,” SMT solvers are more reliable. Agents reason about code, but they don’t exhaustively search a constraint space.</li>
</ul>

<p>The agent’s advantage: <strong>structured exploration</strong>—when invariants are rich, the state machine is complex, and interesting traces require understanding.</p>

<h2 id="the-combination">The combination</h2>

<table>
  <thead>
    <tr>
      <th>Piece</th>
      <th>Role</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><a href="/2026/02/01/agent-skills-and-claude-lint/">Agent skills</a></td>
      <td>Encode <strong>what matters</strong>—norms, invariants, properties</td>
    </tr>
    <tr>
      <td><a href="/2026/02/08/agent-teams-and-claude-swarm/">Agent swarms</a></td>
      <td>Provide <strong>parallel exploration</strong>—multiple agents, different state-space regions</td>
    </tr>
    <tr>
      <td><a href="/2026/01/31/the-bugs-between-calls/">Stateful testing</a></td>
      <td>Provide the <strong>execution engine</strong>—command sequences, invariant checks, shrinking</td>
    </tr>
  </tbody>
</table>

<p>Each piece works alone. Together, they compound.</p>

<p>This hypothesis remains unproven at scale. However, the individual components have demonstrated value in isolation, and their integration appears promising.</p>

<p>The best testing infrastructure usually emerges that way—you notice the pieces reinforcing each other before you design the integration.</p>

<h2 id="related-posts">Related posts</h2>

<ul>
  <li><a href="/2026/01/25/oracles-traces-triage/">Oracles, Traces, Triage</a> (series index)</li>
  <li><a href="/2026/02/01/agent-skills-and-claude-lint/">Agent Skills and claude-lint</a></li>
  <li><a href="/2026/02/08/agent-teams-and-claude-swarm/">Agent Teams and claude-swarm</a></li>
  <li><a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a></li>
  <li><a href="/2025/03/10/chaos-testing-stacks-node/">Chaos Testing stacks-node with Model-Based Stateful Testing</a></li>
</ul>

<hr />

<p><strong>Next:</strong> <a href="/2026/02/16/agents-as-fuzzers/">Agents as Fuzzers</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part 3 of the Oracles, Traces, Triage series. On combining agent swarms and agent skills with stateful property-based testing to find the consensus bugs that live in sequences, not single calls.]]></summary></entry><entry><title type="html">Agent Teams and claude-swarm</title><link href="https://blog.nikosbaxevanis.com/2026/02/08/agent-teams-and-claude-swarm/" rel="alternate" type="text/html" title="Agent Teams and claude-swarm" /><published>2026-02-08T00:00:00+00:00</published><updated>2026-02-08T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/02/08/agent-teams-and-claude-swarm</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/02/08/agent-teams-and-claude-swarm/"><![CDATA[<p><em>This article is part of the Oracles, Traces, Triage <a href="/2026/01/25/oracles-traces-triage/">series</a>.</em></p>

<h2 id="one-agent-hits-a-ceiling">One agent hits a ceiling</h2>

<p>A single Claude Code session can do one thing at a time. For small tasks—fix this function, write that test—that’s fine. But the work I care about is not small. Exploring multiple hypotheses in parallel, maintaining documentation while debugging, running specialized analysis while generating test harnesses.</p>

<p>One agent, one task, one context window. It doesn’t scale.</p>

<h2 id="the-agent-team-pattern">The agent-team pattern</h2>

<p>In early February 2026, Anthropic published <a href="https://www.anthropic.com/engineering/building-c-compiler">Building a C Compiler with Large Language Models</a>—a detailed account of 16 Claude instances working in parallel to produce a 100,000-line Rust-based C compiler capable of building the Linux kernel. The total: nearly 2,000 Claude Code sessions, 2 billion input tokens, 140 million output tokens.</p>

<p>The architecture was surprisingly simple. No orchestrator. No message bus. No shared memory. Just git.</p>

<p>Each agent ran in a Docker container. Each cloned a shared bare repo, worked on a task, and pushed. When two agents tried to claim the same task, git’s built-in conflict resolution forced the second one to pick something else. Merge conflicts happened often; Claude was smart enough to resolve them.</p>

<p>The key insight: <strong>coordination through the codebase itself</strong>. The repo <em>is</em> the shared state. Commits <em>are</em> the messages. Locks <em>are</em> text files.</p>

<h2 id="claude-swarm">claude-swarm</h2>

<p>This pattern can be implemented through <a href="https://github.com/moodmosaic/claude-swarm">claude-swarm</a>—a reusable harness currently wired for running multiple Claude Code sessions in Docker containers, coordinating through git. The coordination pattern itself is tool-agnostic; <code class="language-plaintext highlighter-rouge">claude-swarm</code> is just one concrete implementation.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>export ANTHROPIC_API_KEY="sk-ant-..."
export AGENT_PROMPT="path/to/prompt.md"
./tools/claude-swarm/launch.sh start
./tools/claude-swarm/launch.sh status
./tools/claude-swarm/launch.sh stop
</code></pre></div></div>

<p>The design is minimal by conviction, not by laziness:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Host                        /tmp (bare repos)
~/project/ ── git clone ──&gt; project-upstream.git (rw)
               --bare       project-mirror-*.git (ro)
                                     |
                                     | docker volumes
                                     |
               .---------------------+---------------------.
               |                     |                     |
           Container 1          Container 2          Container 3
           /upstream  (rw)      /upstream  (rw)      /upstream  (rw)
           /mirrors/* (ro)      /mirrors/* (ro)      /mirrors/* (ro)
               |                     |                     |
               v                     v                     v
           /workspace/          /workspace/          /workspace/
           (agent-work)         (agent-work)         (agent-work)
</code></pre></div></div>

<p>All containers mount the same bare repo. When one agent pushes, others see the changes on the next fetch. Each container runs <code class="language-plaintext highlighter-rouge">harness.sh</code>, which clones, resets to <code class="language-plaintext highlighter-rouge">origin/agent-work</code>, runs one Claude session, and loops. Agents stop after a configurable number of idle sessions with no commits.</p>

<h2 id="why-no-orchestrator">Why no orchestrator</h2>

<p>The temptation is always to add a coordinator—something that assigns tasks, monitors progress, resolves conflicts. This approach avoids orchestration for the same reason it avoids workflow verbs in <a href="/2026/02/01/agent-skills-and-claude-lint/">CLAUDE.md</a>: <strong>centralized control tends to reduce agent autonomy and reasoning capabilities</strong>.</p>

<p>With no orchestrator, each agent must orient itself. It reads the README, checks the current state of the code, looks at what other agents have done, and decides what to work on next. This mirrors how good engineering teams actually function: shared context, local autonomy, coordination through artifacts.</p>

<p>Anthropic’s experience confirmed the pattern. Their agents maintained running docs of failed approaches. They took locks on tasks by writing text files. They specialized naturally—one agent coalescing duplicate code, another improving performance, another working on documentation.</p>

<h2 id="specialization-is-possible-not-required">Specialization is possible, not required</h2>

<p>Right now, all agents in claude-swarm share the same prompt. They self-organize by looking at the repo and picking different things to work on.</p>

<p>Anthropic’s experience suggests that per-agent prompts—one focused on code quality, another on test coverage, another on documentation—can help at scale. claude-swarm supports that (just point <code class="language-plaintext highlighter-rouge">AGENT_PROMPT</code> at different files per container), but In practice, shared prompts often suffice for initial implementations. Agents typically self-organize effectively without specialized prompts.</p>

<p>This connects to the <a href="/2026/02/01/agent-skills-and-claude-lint/">agent skills philosophy</a>: the prompt shapes behavior. The harness just runs the loop.</p>

<h2 id="when-it-works-when-it-doesnt">When it works, when it doesn’t</h2>

<p>Agent swarms work best when the problem decomposes into independent sub-tasks—many distinct failing tests, different modules, separate components. Each agent picks a different piece, and parallelism is trivial.</p>

<p>They struggle when the problem is monolithic. Anthropic hit this when compiling the Linux kernel: every agent would find the same bug, fix it independently, and overwrite each other’s changes. Their solution was to use GCC as an oracle and randomly split compilation between GCC and their compiler, letting each agent work on different failing file subsets.</p>

<p>For testing work, the decomposition is usually natural. Different invariants to test. Different modules to fuzz. Different state-machine paths to explore. The swarm pattern fits.</p>

<h2 id="what-this-is-really-about">What this is really about</h2>

<p>claude-swarm is about 200 lines of shell. It’s not the point.</p>

<p>The point is that the agent-team pattern—N autonomous agents, shared codebase, no central control—is a genuine paradigm for how AI-assisted work can scale. It’s not about making one agent smarter. It’s about making many agents productive together, the same way you’d make a team of engineers productive: clear context, local ownership, shared truth in the repo.</p>

<p>The C compiler was the proof of concept. Fuzz testing is where I’m applying it.</p>

<h2 id="related-posts">Related posts</h2>

<ul>
  <li><a href="/2026/01/25/oracles-traces-triage/">Oracles, Traces, Triage</a> (series index)</li>
  <li><a href="/2026/02/01/agent-skills-and-claude-lint/">Agent Skills and claude-lint</a></li>
  <li><a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a></li>
</ul>

<hr />

<p><strong>Next:</strong> <a href="/2026/02/15/testing-between-calls-with-agents/">Testing the Bugs Between Calls</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part 2 of the Oracles, Traces, Triage series. On running multiple agents in parallel through git—no orchestrator, no message passing—and why the pattern matters more than the tool.]]></summary></entry><entry><title type="html">Agent Skills and claude-lint</title><link href="https://blog.nikosbaxevanis.com/2026/02/01/agent-skills-and-claude-lint/" rel="alternate" type="text/html" title="Agent Skills and claude-lint" /><published>2026-02-01T00:00:00+00:00</published><updated>2026-02-01T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/02/01/agent-skills-and-claude-lint</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/02/01/agent-skills-and-claude-lint/"><![CDATA[<p><em>This article is part of the Oracles, Traces, Triage <a href="/2026/01/25/oracles-traces-triage/">series</a>.</em></p>

<h2 id="the-temptation">The temptation</h2>

<p>The first thing most people do with a <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> file is write a recipe. Step 1, do this. Step 2, do that. If you see an error, run this command. Here’s a code block you can paste.</p>

<p>It works. For about a week. Then the codebase shifts, the recipe goes stale, and the model follows outdated instructions with the confidence of someone who doesn’t know they’re wrong.</p>

<p>I’ve seen this pattern before. It’s the same failure mode as over-specified test fixtures: the more you hard-code the steps, the more brittle the system becomes. The test passes for the wrong reasons. The agent succeeds for the wrong reasons.</p>

<h2 id="context-should-shape-reasoning-not-script-behavior">Context should shape reasoning, not script behavior</h2>

<p>This distinction proves crucial in practice. When <code class="language-plaintext highlighter-rouge">.claude/</code> directories emphasize workflows over norms, models tend to follow outdated instructions rigidly. When structured around principles and facts, models demonstrate greater adaptability to changing contexts.</p>

<p>Whether this constitutes “reasoning from principles” in a deep sense remains an open question. However, the resulting outputs consistently demonstrate improved quality and relevance.</p>

<p>Think about it from a testing perspective. A unit test that asserts <code class="language-plaintext highlighter-rouge">f(3) == 7</code> checks one input. A property that asserts <code class="language-plaintext highlighter-rouge">for all x: f(f_inverse(x)) == x</code> checks the <em>relationship</em>.</p>

<p>Change how <code class="language-plaintext highlighter-rouge">f</code> computes internally and the property still holds—it only cares that the roundtrip works. The hard-coded assertion breaks the moment the mapping shifts.</p>

<p>Same idea. A <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> that says “run <code class="language-plaintext highlighter-rouge">cargo test</code> after every change” is a hard-coded assertion. A <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> that says “all changes must pass the existing test suite” is a property. The model can figure out <em>how</em> to run the tests. What it needs from you is <em>what matters</em>.</p>

<h2 id="the-layers">The layers</h2>

<p>Over time, I’ve settled on a layered structure for <code class="language-plaintext highlighter-rouge">.claude/</code> directories:</p>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>What belongs</th>
      <th>What doesn’t</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">CLAUDE.md</code></td>
      <td>Norms, facts, project conventions</td>
      <td>Workflow verbs (“step 1”, “then do”), code blocks</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">agents/*.md</code></td>
      <td>Perspective, values (≤120 lines)</td>
      <td>Procedures, code blocks</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">skills/*/SKILL.md</code></td>
      <td>Capabilities (≤500 lines)</td>
      <td>Success criteria, code blocks</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">references/*.md</code></td>
      <td>Playbooks, optional reference material</td>
      <td>Missing “optional” declaration</td>
    </tr>
  </tbody>
</table>

<p><strong>CLAUDE.md</strong> is the constitution. Short. Declarative. “This project uses Rust.” “Tests must pass before commits.” “Prefer explicit error handling over unwrap.” No instructions on <em>how</em> to do things—just <em>what matters</em>.</p>

<p><strong>Agents</strong> get a perspective. If you have a code-quality agent, it gets values like “favor readability over cleverness” and “flag any function longer than 40 lines.” It doesn’t get a checklist.</p>

<p><strong>Skills</strong> describe capabilities the model can use—not step-by-step procedures. A skill for “running fuzzers” says what the fuzzer does, what inputs it expects, what success looks like at a high level. It does <em>not</em> contain a bash script.</p>

<p><strong>References</strong> are the escape hatch. Sometimes you genuinely need a playbook—a deployment procedure, a migration guide. References hold those, but they must declare themselves as optional. The model should know these are reference material, not marching orders.</p>

<h2 id="claude-lint">claude-lint</h2>

<p>A Rust CLI tool called <a href="https://github.com/moodmosaic/claude-lint">claude-lint</a> helps enforce these patterns by checking <code class="language-plaintext highlighter-rouge">.claude/</code> directories for violations.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ claude-lint .claude
ok: .claude passes all checks

$ claude-lint /path/to/.claude
error: /path/to/.claude/CLAUDE.md: contains workflow verb 'step 1'
error: /path/to/.claude/skills/foo/SKILL.md: contains fenced code block
2 error(s)
</code></pre></div></div>

<p>It checks for:</p>

<ul>
  <li><strong>Workflow verbs</strong> in <code class="language-plaintext highlighter-rouge">CLAUDE.md</code> (e.g., “step 1”, “then run”, “next, do”)</li>
  <li><strong>Code blocks</strong> where they don’t belong (everywhere except references)</li>
  <li><strong>Line limits</strong> on agents (≤120) and skills (≤500)</li>
  <li><strong>Missing “optional” declarations</strong> in reference files</li>
</ul>

<p>It’s deliberately strict. The point is not to make <code class="language-plaintext highlighter-rouge">.claude/</code> directories pleasant to read. The point is to keep them in the shape where I’ve seen the model produce the best results.</p>

<h2 id="why-this-matters-in-practice">Why this matters in practice</h2>

<p>Claude Code demonstrates this approach in practice, using structured context to explore edge cases, generate test harnesses, and reason about state-machine invariants. The quality of outputs correlates directly with the quality of provided context.</p>

<p>When I embed workflows, the model sticks to them—even when they’re wrong for the current situation. When I embed norms (“never skip precondition checks”, “all state transitions must be tested for idempotence”), I get output that adapts to whatever the model finds in the codebase.</p>

<p>Whether that’s “reasoning from norms” or just the model having more room to draw on its training, I can’t say for certain. What I can say is that the parallel to property-based testing feels right. Properties tell the system <em>what must hold</em>. The system figures out <em>how to check it</em>. Norms tell the model <em>what matters</em>. The model figures out <em>how to act on it</em>.</p>

<p>Same shape. I’ll take it.</p>

<h2 id="related-posts">Related posts</h2>

<ul>
  <li><a href="/2026/01/25/oracles-traces-triage/">Oracles, Traces, Triage</a> (series index)</li>
  <li><a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a></li>
</ul>

<hr />

<p><strong>Next:</strong> <a href="/2026/02/08/agent-teams-and-claude-swarm/">Agent Teams and claude-swarm</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part 1 of the Oracles, Traces, Triage series. On structuring .claude/ directories so the model adapts from norms instead of following stale scripts—and a Rust tool to enforce it.]]></summary></entry><entry><title type="html">The Bugs Between Calls</title><link href="https://blog.nikosbaxevanis.com/2026/01/31/the-bugs-between-calls/" rel="alternate" type="text/html" title="The Bugs Between Calls" /><published>2026-01-31T00:00:00+00:00</published><updated>2026-01-31T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/01/31/the-bugs-between-calls</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/01/31/the-bugs-between-calls/"><![CDATA[<p>Anthropic just showed how far property-based testing can go when you can express a property at a function boundary. Their agent <a href="https://red.anthropic.com/2026/property-based-testing/">generated Hypothesis tests for real-world libraries and validated/reported several bugs in NumPy, Pandas, and SciPy</a>.</p>

<p>One important gap is that, while we still see critical bugs in single calls (SSZ decoding, BLS edge cases), many of the most expensive recent failures live <em>between calls</em>.</p>

<p>In many cases, the protocol rules are fine; the failure is a valid-but-expensive trace that turns into a liveness incident under load.</p>

<p><strong>December 2025:</strong> Shortly after Fusaka activated (Dec 3, 2025), Prysm hit a resource-exhaustion path processing certain attestations, dropping network participation to ~75% and pushing voting participation as low as ~74.7% in some epochs—uncomfortably close to the 2/3 stake threshold required for finality. In this incident, attestations referencing a previous-epoch block root could trigger repeated state recreation, replay, and epoch-transition recomputation, exhausting node resources under load. (See the <a href="https://prysm.offchainlabs.com/docs/misc/mainnet-postmortems/">Prysm mainnet postmortems</a> for the primary write-up.)</p>

<p><strong>May 2023:</strong> Mainnet finality was delayed twice within ~24 hours (first ~4 epochs, then ~9). The trigger was valid old-target attestations that forced expensive beacon-state regeneration in some clients; diversity helped the chain recover without intervention. (Postmortem: <a href="https://medium.com/offchainlabs/post-mortem-report-ethereum-mainnet-finality-05-11-2023-95e271dfd8b2">Ethereum Mainnet Finality Incident (May 2023)</a>.)</p>

<p><strong>April 2023:</strong> Stacks hit a PoX-2 bug in <code class="language-plaintext highlighter-rouge">stack-increase</code> that impacted Stacking rewards for a cycle. The details are different, but the shape is familiar: stateful logic where correctness is about how an accounting state evolves over a sequence of actions, not a single call in isolation. I wasn’t at Stacks at the time, and I’m not claiming “I would have caught it” — I mention it because it’s a clean example of why tests that exercise <em>traces</em> (not just inputs) matter. (Thread: <a href="https://forum.stacks.org/t/a-bug-in-stacks-increase-call-is-impacting-stacking-rewards-this-cycle/14867">A bug in stacks-increase call is impacting Stacking rewards this cycle</a>.)</p>

<p>Each operation was valid. The sequence proved problematic only under load.</p>

<p>Many expensive bugs live in <em>sequences</em> that look fine individually.</p>

<h2 id="stateless-properties-and-where-they-stop">Stateless properties (and where they stop)</h2>

<p>Stateless properties shine when:</p>

<ul>
  <li>the function boundary is the correctness boundary</li>
  <li>behavior is local to a single invocation</li>
  <li>invariants don’t depend on history</li>
</ul>

<p>This covers a lot of “pure-ish” code: parsing, formatting, serialization, numerical edge cases.</p>

<p>But consensus software is not primarily pure functions.</p>

<h2 id="consensus-clients-are-state-machines">Consensus clients are state machines</h2>

<p>Ethereum’s consensus clients (e.g., <a href="https://github.com/sigp/lighthouse">Lighthouse</a>, Prysm, Teku, Grandine, Nimbus, Lodestar) implement a long-lived state machine:</p>

<ul>
  <li><strong>per-slot processing</strong> (slots advance, duties change, messages arrive out of order)</li>
  <li><strong>data availability</strong> (verifying that required data is available; evolving toward Data Availability Sampling via <a href="https://eips.ethereum.org/EIPS/eip-7594">PeerDAS</a>)</li>
  <li><strong>fork choice</strong> (multiple competing branches, attestation-weighted via LMD-GHOST)</li>
  <li><strong>finality</strong> (justified/finalized checkpoints that must only move forward)</li>
  <li><strong>storage and replay</strong> (idempotence, witness caching, pruning, reorgs)</li>
</ul>

<p>Correctness is rarely “the output of one function call”.
It’s “the system’s behavior over a trace”.</p>

<h2 id="examples-of-stateful-invariants-in-consensus-clients">Examples of stateful invariants in consensus clients</h2>

<p>Here are a few invariants that are naturally <em>history-dependent</em>:</p>

<ul>
  <li><strong>Finality is monotonic</strong>: the finalized checkpoint’s epoch must never decrease. (Finality can stall; it must not regress.)</li>
  <li><strong>Fork choice respects finality</strong>: once a checkpoint is finalized, the selected head must be a descendant of it. (Heads can reorg; finalized history cannot.)</li>
  <li><strong>Data Availability gates what validators can accept/vote for</strong>: A block header is not enough in 2026. Availability is enforced via fork-choice/voting rules: validators should only accept and vote for blocks once sufficient data availability has been verified (today: all blobs; with <a href="https://eips.ethereum.org/EIPS/eip-7594">PeerDAS</a>: sampling cells/columns). Fork choice can only safely give full weight to blocks that validators can legally vote for. Testing the transition from “pending availability” to “available and valid” is a classic stateful trace.</li>
  <li><strong>Stake-weighted participation (MaxEB)</strong>: Participation was always stake-weighted, but <a href="https://eips.ethereum.org/EIPS/eip-7251">MaxEB</a> makes the variance visible by raising the cap from 32 ETH to 2048 ETH. Not every validator will immediately sit at the cap, but stake weight per validator can now vary widely, so a bug that affects a handful of high-effective-balance validators can represent outsized stake impact.</li>
  <li><strong>Idempotent imports</strong>: importing the <em>same block</em> twice (same root) must not double-apply side effects (DB indexes, caches, votes, metrics, etc.).</li>
  <li><strong>Equivocations must be handled, not assumed away</strong>: you can see multiple distinct blocks for the same slot. A client shouldn’t panic or corrupt state just because reality is adversarial.</li>
  <li><strong>Epoch-boundary logic must not run twice</strong>: “do X once per epoch” bugs are classic state-machine failures when reorgs, retries, and partial persistence meet.</li>
</ul>

<p>If you try to phrase these as “(f(x)) preserves (P)”, you end up smuggling “history” into (x) until it stops being a useful boundary.</p>

<p>Take “finality is monotonic.” You might try:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>for all (old_finalized, new_finalized):
    process_block(...) implies new_finalized &gt;= old_finalized
</code></pre></div></div>

<p>But now <code class="language-plaintext highlighter-rouge">old_finalized</code> is part of the input. Where does it come from? You have to generate it. And to generate a <em>valid</em> old state, you need to know what sequence of blocks led there. You’ve just reinvented traces—badly.</p>

<p>The honest framing is: “after any valid sequence of operations, the finalized epoch never decreases”. That’s a property over traces, not over inputs.</p>

<h2 id="model-based-stateful-property-based-testing">Model-based, stateful property-based testing</h2>

<p>Stateful testing makes the history explicit:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>State --(Command)--&gt; State'
</code></pre></div></div>

<p>Instead of generating inputs for a single call, you generate <em>commands</em> and run them as a scenario.
The bug is often not in any single step, but in a <em>particular ordering</em> of steps.</p>

<p>This idea is old and battle-tested (<a href="https://hackage.haskell.org/package/quickcheck-state-machine">QuickCheck state machine testing</a>, <a href="https://github.com/hedgehogqa/haskell-hedgehog">Hedgehog</a>, <a href="https://crates.io/crates/proptest-state-machine">proptest-state-machine</a>), but to my knowledge still underused in many production systems.</p>

<p><a href="/2025/03/10/chaos-testing-stacks-node/">The same approach, built into madhouse-rs, caught a production bug in the Stacks blockchain</a> that traditional testing missed. A 533-line integration test failed to reproduce it. A chaotic command sequence succeeded.</p>

<p>Model-based, stateful testing has been applied successfully to production systems like the Stacks PoX contracts. The approach proved practical for ongoing use, helping catch issues that traditional testing methods missed and demonstrating the value of stateful property testing in complex consensus systems.</p>

<h2 id="a-minimal-rust-harness-the-boring-runner">A minimal Rust harness (the “boring runner”)</h2>

<p>The core trick is to keep the runner boring and put all the logic in commands.
This is the same shape that scales in practice.</p>

<h3 id="state-and-context">State and context</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">trait</span> <span class="n">State</span><span class="p">:</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="n">Debug</span> <span class="p">{}</span>

<span class="k">pub</span> <span class="k">trait</span> <span class="n">TestContext</span><span class="p">:</span> <span class="nn">std</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="n">Debug</span> <span class="o">+</span> <span class="nb">Clone</span> <span class="p">{}</span>
</code></pre></div></div>

<p>For the examples below, assume an empty context:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive(Debug,</span> <span class="nd">Clone,</span> <span class="nd">Default)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">BeaconContext</span><span class="p">;</span>
<span class="k">impl</span> <span class="n">TestContext</span> <span class="k">for</span> <span class="n">BeaconContext</span> <span class="p">{}</span>
</code></pre></div></div>

<h3 id="commands">Commands</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">proptest</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nb">Arc</span><span class="p">;</span>

<span class="k">pub</span> <span class="k">trait</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span> <span class="n">State</span><span class="p">,</span> <span class="n">C</span><span class="p">:</span> <span class="n">TestContext</span><span class="o">&gt;</span><span class="p">:</span>
    <span class="nn">std</span><span class="p">::</span><span class="nn">fmt</span><span class="p">::</span><span class="n">Debug</span> <span class="o">+</span> <span class="nb">Send</span> <span class="o">+</span> <span class="nb">Sync</span>
<span class="p">{</span>
    <span class="c1">// Precondition: is this command meaningful *now*.</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">S</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span><span class="p">;</span>

    <span class="c1">// Apply the transition and assert postconditions.</span>
    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">S</span><span class="p">);</span>

    <span class="c1">// For debugging and shrunk traces.</span>
    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span><span class="p">;</span>

    <span class="c1">// Generate commands.</span>
    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span><span class="n">ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">C</span><span class="o">&gt;</span><span class="p">)</span>
        <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">C</span><span class="o">&gt;&gt;</span>
    <span class="k">where</span>
        <span class="k">Self</span><span class="p">:</span> <span class="nb">Sized</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="wrapper-for-heterogeneous-sequences">Wrapper for heterogeneous sequences</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive(Clone)]</span>
<span class="k">pub</span> <span class="k">struct</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span> <span class="n">State</span><span class="p">,</span> <span class="n">C</span><span class="p">:</span> <span class="n">TestContext</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">command</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="k">dyn</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">C</span><span class="o">&gt;&gt;</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span> <span class="n">State</span><span class="p">,</span> <span class="n">C</span><span class="p">:</span> <span class="n">TestContext</span><span class="o">&gt;</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">C</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="k">fn</span> <span class="n">new</span><span class="o">&lt;</span><span class="n">T</span><span class="o">&gt;</span><span class="p">(</span><span class="n">t</span><span class="p">:</span> <span class="n">T</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="k">Self</span>
    <span class="k">where</span>
        <span class="n">T</span><span class="p">:</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">C</span><span class="o">&gt;</span> <span class="o">+</span> <span class="k">'static</span><span class="p">,</span>
    <span class="p">{</span>
        <span class="k">Self</span> <span class="p">{</span> <span class="n">command</span><span class="p">:</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="execution-loop">Execution loop</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">pub</span> <span class="k">fn</span> <span class="n">execute_commands</span><span class="o">&lt;</span><span class="n">S</span><span class="p">:</span> <span class="n">State</span><span class="p">,</span> <span class="n">C</span><span class="p">:</span> <span class="n">TestContext</span><span class="o">&gt;</span><span class="p">(</span>
    <span class="n">commands</span><span class="p">:</span> <span class="o">&amp;</span><span class="p">[</span><span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">S</span><span class="p">,</span> <span class="n">C</span><span class="o">&gt;</span><span class="p">],</span>
    <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">S</span><span class="p">,</span>
<span class="p">)</span> <span class="p">{</span>
    <span class="k">for</span> <span class="n">cmd</span> <span class="k">in</span> <span class="n">commands</span> <span class="p">{</span>
        <span class="k">if</span> <span class="n">cmd</span><span class="py">.command</span><span class="nf">.check</span><span class="p">(</span><span class="n">state</span><span class="p">)</span> <span class="p">{</span>
            <span class="n">cmd</span><span class="py">.command</span><span class="nf">.apply</span><span class="p">(</span><span class="n">state</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The point is <em>locality</em>: generation, preconditions, transition logic, and invariants live together.
That design choice is exactly the “data-open” side of <a href="/2025/03/25/expression-problem-in-practice/">the expression problem</a>, and it’s why these harnesses survive contact with real systems.</p>

<h2 id="a-consensus-client-flavored-example-with-correct-slot-semantics">A consensus-client-flavored example (with correct slot semantics)</h2>

<p>One easy trap is to assume “there is only one block per slot”.
In the spec there is one <em>proposer</em> per slot, but on the network you can see:</p>

<ul>
  <li>equivocations (two blocks for the same slot from the proposer)</li>
  <li>different views due to propagation delays</li>
  <li>reorgs that temporarily make a “worse” chain the head</li>
</ul>

<p>So a stateful invariant should not be “reject a second block at slot (s)”.
That’s not how fork choice works.</p>

<p>Instead, here’s a deliberately small example that matches real failure modes: <strong>idempotence by block root</strong>.
If a client re-imports the same block (same root), it must not double-apply side effects.</p>

<h3 id="model-state">Model state</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">collections</span><span class="p">::{</span><span class="n">HashMap</span><span class="p">,</span> <span class="n">HashSet</span><span class="p">};</span>

<span class="nd">#[derive(Debug,</span> <span class="nd">Default)]</span>
<span class="k">struct</span> <span class="n">BeaconModel</span> <span class="p">{</span>
    <span class="n">current_slot</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>

    <span class="c1">// Slot -&gt; set of known block roots at that slot (forks allowed).</span>
    <span class="n">known_by_slot</span><span class="p">:</span> <span class="n">HashMap</span><span class="o">&lt;</span><span class="nb">u64</span><span class="p">,</span> <span class="n">HashSet</span><span class="o">&lt;</span><span class="p">[</span><span class="nb">u8</span><span class="p">;</span> <span class="mi">32</span><span class="p">]</span><span class="o">&gt;&gt;</span><span class="p">,</span>

    <span class="c1">// In 2026, participation is stake-weighted (MaxEB / EIP-7251).</span>
    <span class="c1">// Total weight of unique blocks we've imported.</span>
    <span class="n">total_imported_weight</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>

    <span class="c1">// Track which block states are available to prevent the 2025 Prysm regression</span>
    <span class="c1">// (expensive state regeneration when validating attestations for uncached blocks).</span>
    <span class="n">state_cache</span><span class="p">:</span> <span class="n">HashSet</span><span class="o">&lt;</span><span class="p">[</span><span class="nb">u8</span><span class="p">;</span> <span class="mi">32</span><span class="p">]</span><span class="o">&gt;</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">State</span> <span class="k">for</span> <span class="n">BeaconModel</span> <span class="p">{}</span>
</code></pre></div></div>

<h3 id="command-tick-time">Command: tick time</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive(Debug)]</span>
<span class="k">struct</span> <span class="n">TickSlot</span><span class="p">;</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">BeaconModel</span><span class="p">,</span> <span class="n">BeaconContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">TickSlot</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">_state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">BeaconModel</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span> <span class="k">true</span> <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">BeaconModel</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span><span class="py">.current_slot</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span> <span class="s">"TICK_SLOT"</span><span class="nf">.to_string</span><span class="p">()</span> <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h3 id="command-import-a-block-stake-weighted-duplicates-forbidden">Command: import a block (stake-weighted, duplicates forbidden)</h3>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">#[derive(Debug)]</span>
<span class="k">struct</span> <span class="n">ImportBlock</span> <span class="p">{</span>
    <span class="n">slot</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">root</span><span class="p">:</span> <span class="p">[</span><span class="nb">u8</span><span class="p">;</span> <span class="mi">32</span><span class="p">],</span>
    <span class="n">weight</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span> <span class="c1">// Stake-weighted via MaxEB.</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">BeaconModel</span><span class="p">,</span> <span class="n">BeaconContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">ImportBlock</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">BeaconModel</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.slot</span> <span class="o">&lt;=</span> <span class="n">state</span><span class="py">.current_slot</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">BeaconModel</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">entry</span> <span class="o">=</span> <span class="n">state</span>
            <span class="py">.known_by_slot</span>
            <span class="nf">.entry</span><span class="p">(</span><span class="k">self</span><span class="py">.slot</span><span class="p">)</span>
            <span class="nf">.or_default</span><span class="p">();</span>
        <span class="k">let</span> <span class="n">is_new</span> <span class="o">=</span> <span class="n">entry</span><span class="nf">.insert</span><span class="p">(</span><span class="k">self</span><span class="py">.root</span><span class="p">);</span>

        <span class="c1">// This is the invariant: same root must not be "new" twice.</span>
        <span class="c1">// Stake-weighting means a duplicate root shouldn't double-count weight.</span>
        <span class="k">if</span> <span class="n">is_new</span> <span class="p">{</span>
            <span class="n">state</span><span class="py">.total_imported_weight</span> <span class="o">+=</span> <span class="k">self</span><span class="py">.weight</span><span class="p">;</span>
            <span class="n">state</span><span class="py">.state_cache</span><span class="nf">.insert</span><span class="p">(</span><span class="k">self</span><span class="py">.root</span><span class="p">);</span>
        <span class="p">}</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"IMPORT_BLOCK(slot={}, weight={})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.slot</span><span class="p">,</span> <span class="k">self</span><span class="py">.weight</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// The bug from December 2025: attestations for stale blocks </span>
<span class="c1">// triggering expensive state regeneration.</span>
<span class="nd">#[derive(Debug)]</span>
<span class="k">struct</span> <span class="n">ProcessAttestation</span> <span class="p">{</span>
    <span class="n">slot</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">block_root</span><span class="p">:</span> <span class="p">[</span><span class="nb">u8</span><span class="p">;</span> <span class="mi">32</span><span class="p">],</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">BeaconModel</span><span class="p">,</span> <span class="n">BeaconContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">ProcessAttestation</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">BeaconModel</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="k">self</span><span class="py">.slot</span> <span class="o">&lt;=</span> <span class="n">state</span><span class="py">.current_slot</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">BeaconModel</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Invariant: looking up state for an attestation must not </span>
        <span class="c1">// cause a "miss" that triggers an expensive re-play.</span>
        <span class="nd">assert!</span><span class="p">(</span>
            <span class="n">state</span><span class="py">.state_cache</span><span class="nf">.contains</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="py">.block_root</span><span class="p">),</span>
            <span class="s">"State cache miss for block {:?} at slot {}"</span><span class="p">,</span> 
            <span class="k">self</span><span class="py">.block_root</span><span class="p">,</span> <span class="k">self</span><span class="py">.slot</span>
        <span class="p">);</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"PROCESS_ATTESTATION(slot={})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.slot</span><span class="p">)</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>If the implementation accidentally increments counters, updates indexes, or applies cached transitions twice on duplicate import, a failing trace usually shrinks to something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[
  TICK_SLOT,
  IMPORT_BLOCK(slot=1, root=R),
  IMPORT_BLOCK(slot=1, root=R),
]
</code></pre></div></div>

<p>If the system is not truly idempotent, stateful testing will reduce a complex failure down to the smallest possible sequence—often just “process the same thing twice”—making the bug obvious and undeniable.</p>

<p>That is the shape of a lot of consensus-client failures: not “wrong return value”, but “the second time through a path, something subtle breaks”.</p>

<h2 id="stateless--stateful-is-the-real-combination">Stateless + stateful is the real combination</h2>

<p>You want both:</p>

<ul>
  <li><strong>Stateless property-based testing (PBT)</strong> for pure-ish components: SSZ encoding, BLS wrappers, serialization, bitfields. This is where <a href="https://red.anthropic.com/2026/property-based-testing/">Anthropic’s approach</a> shines.</li>
  <li><strong>Stateful PBT</strong> for the hard parts: fork choice, finality logic, DB/replay, reorg handling, epoch boundaries.</li>
</ul>

<p>Stateless PBT finds bugs in the bricks (SSZ, BLS). Stateful PBT finds bugs in how the bricks stack—especially in the high-stakes world of PeerDAS and stake-weighted participation.</p>

<p>Anthropic showed us how to check the mortar really well. This post is about the wall.</p>

<h2 id="further-reading">Further Reading</h2>

<ul>
  <li><a href="https://eips.ethereum.org/EIPS/eip-7594">EIP-7594: Peer Data Availability Sampling (PeerDAS)</a></li>
  <li><a href="https://eips.ethereum.org/EIPS/eip-7251">EIP-7251: Increase the MAX_EFFECTIVE_BALANCE (MaxEB)</a></li>
  <li><a href="https://ethereum.github.io/consensus-specs/">Ethereum Consensus Specifications</a></li>
  <li><a href="https://ethereum.org/en/developers/docs/consensus-mechanisms/pos/gasper/">Gasper: LMD-GHOST + Casper FFG</a></li>
  <li><a href="https://eth2book.info/latest/part2/consensus/lmd_ghost/">LMD GHOST Fork Choice (eth2book)</a></li>
  <li><a href="https://github.com/ethereum/execution-apis/blob/main/src/engine/common.md">Ethereum Engine API Specification</a></li>
  <li><a href="https://clientdiversity.org/">Client Diversity Dashboard</a></li>
  <li><a href="https://prysm.offchainlabs.com/docs/misc/mainnet-postmortems/">Prysm Mainnet Postmortems (includes Fusaka incident)</a></li>
  <li><a href="https://medium.com/offchainlabs/post-mortem-report-ethereum-mainnet-finality-05-11-2023-95e271dfd8b2">Post-mortem: Ethereum Mainnet Finality Incident (May 2023)</a></li>
  <li><a href="https://forum.stacks.org/t/a-bug-in-stacks-increase-call-is-impacting-stacking-rewards-this-cycle/14867">Stacks PoX-2 <code class="language-plaintext highlighter-rouge">stack-increase</code> bug (April 2023)</a></li>
</ul>

<h2 id="related-posts">Related posts</h2>

<ul>
  <li><a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a></li>
  <li><a href="/2025/01/10/state-machine-testing-proptest">Model-Based Stateful Testing with proptest-state-machine</a></li>
  <li><a href="/2025/02/10/scaling-with-madhouse-rs">Scaling Model-Based Stateful Testing with madhouse-rs</a></li>
  <li><a href="/2025/03/25/expression-problem-in-practice">The Expression Problem in Practice: A Trait-Based Testing Harness</a></li>
  <li><a href="/2025/03/10/chaos-testing-stacks-node/">Chaos Testing stacks-node with Model-Based Stateful Testing</a></li>
  <li><a href="/2023/03/03/property-tests-vs-fuzzing/">Property Tests Are Not A Fuzzer</a></li>
  <li><a href="/2023/12/15/fuzzing-meets-property-testing/">Fuzzing meets property testing</a></li>
</ul>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Anthropic's recent work shows how far property-based testing can go when you can express properties at a function boundary. This follow-up argues that consensus clients need model-based, stateful testing to catch failures that only appear across sequences of events.]]></summary></entry><entry><title type="html">Oracles, Traces, Triage</title><link href="https://blog.nikosbaxevanis.com/2026/01/25/oracles-traces-triage/" rel="alternate" type="text/html" title="Oracles, Traces, Triage" /><published>2026-01-25T00:00:00+00:00</published><updated>2026-01-25T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2026/01/25/oracles-traces-triage</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2026/01/25/oracles-traces-triage/"><![CDATA[<p>In 2026, “agentic” tooling is moving fast enough that yesterday’s workflow advice goes stale quickly. This series is my attempt to write down the parts that seem durable: how to give agents norms instead of scripts, how to coordinate multiple agents through the repo, and how those patterns connect back to fuzzing and stateful testing.</p>

<h2 id="why">Why</h2>

<p>Skepticism is healthy here. Agent outputs still need oracles, reproducibility, and a bar for correctness. The tools are changing quickly; the craft is not.</p>

<p>Agents can draft tests fast; the hard part is still choosing the right oracles and insisting on reproducible failures.</p>

<p>Examples use <a href="https://docs.anthropic.com/en/docs/claude-code/overview">Claude Code</a> because that’s what I run day-to-day, but the patterns are meant to travel to any agent that can read a codebase, run checks, and write down findings.</p>

<p>This is not a tutorial. It’s a practitioner’s notebook.</p>

<h2 id="the-lens-oracles-traces-triage">The lens: oracles, traces, triage</h2>

<p>If there’s a unifying theme here, it’s that most bug-finding systems succeed or fail on three things:</p>

<ul>
  <li><strong>Oracles</strong> — how you decide something is wrong. Not just “did it crash?”, but invariants, spec checks, and properties that reflect what you actually care about.</li>
  <li><strong>Traces</strong> — many expensive bugs live in sequences, not calls. Stateful testing is about generating and shrinking traces until the failure is undeniable.</li>
  <li><strong>Triage</strong> — search produces noise. The work is making findings reproducible, minimal, deduped, and actionable (ideally as regression tests).</li>
</ul>

<p>That’s also a useful way to read the series:</p>

<ul>
  <li>Part 1 (skills/norms) mostly expands the <strong>oracle surface</strong> (“what matters”).</li>
  <li>Part 2 (agent teams) scales the search and improves the artifacts that enable <strong>triage</strong>.</li>
  <li>Part 3 (between calls/stateful) is explicitly about <strong>traces</strong>.</li>
  <li>Part 4 (agents as fuzzers) argues agents are search tools too—and the hard parts are still <strong>oracles + triage</strong>.</li>
</ul>

<h2 id="articles">Articles</h2>

<ul>
  <li><a href="/2026/02/01/agent-skills-and-claude-lint/">Part 1</a> - Agent Skills and claude-lint<br />
How to structure <code class="language-plaintext highlighter-rouge">.claude/</code> context so agents adapt from norms instead of blindly following scripts (plus a linter to keep it from drifting).</li>
  <li><a href="/2026/02/08/agent-teams-and-claude-swarm/">Part 2</a> - Agent Teams and claude-swarm<br />
A practical pattern for parallel agents that coordinate through git, with no orchestrator—because the repo can be the shared state.</li>
  <li><a href="/2026/02/15/testing-between-calls-with-agents/">Part 3</a> - Testing the Bugs Between Calls<br />
How agent skills, agent swarms, and stateful testing could combine to find consensus bugs that live in traces.</li>
  <li><a href="/2026/02/16/agents-as-fuzzers/">Part 4</a> - Agents as Fuzzers<br />
A structural analogy: both fuzzers and AI agents search for failures that require triage and oracles.</li>
</ul>

<h2 id="companion-post">Companion post</h2>

<p>If you want the deeper motivation for “why traces,” start here:</p>

<ul>
  <li><a href="/2026/01/31/the-bugs-between-calls/">The Bugs Between Calls</a></li>
</ul>

<h2 id="next">Next</h2>

<ul>
  <li>
    <p>Part 5 — Self-hosted agents on Runpod (and friends)<br />
Turning inference into a reliable test service: latency/cost knobs, guardrails, artifacts, and how to run agent loops against real repos.</p>
  </li>
  <li>
    <p>Part 6 — Quantization as a feature: cheap tests when deep reasoning isn’t needed<br />
Using smaller/quantized models for throughput work (scaffolding, formatting, test expansion) and reserving big models for judgment-heavy steps.</p>
  </li>
  <li>
    <p>Part 7 — Corpus, shrink, triage: turning agent output into a fuzzing pipeline<br />
How to dedupe/minimize failures and turn “agent finds” into reproducible bug packets and long-lived regression corpora.</p>
  </li>
</ul>

<h2 id="background">Background</h2>

<p>Most of what I know about testing came from shipping production systems and learning in public through open source: contributing to <a href="https://github.com/AutoFixture/AutoFixture">AutoFixture</a> starting around 2011, then maintaining <a href="https://github.com/hedgehogqa">Hedgehog</a>, which once powered Echidna, an early and widely used property-based fuzzer for Ethereum smart contracts.</p>

<p>Along the way: <a href="https://github.com/moodmosaic/Fare">Fare</a> for regex-constrained test generation, a <a href="https://github.com/moodmosaic/splitmix">SplitMix</a> port for reproducible failure discovery. Consensus fuzzers at Stacks that <a href="/2025/03/10/chaos-testing-stacks-node/">caught a production bug</a> a 533-line integration test couldn’t reproduce.</p>

<p>That background is why I’m interested in AI tooling—not as a replacement for any of this, but as a way to do more of it.</p>

<h2 id="feedback">Feedback</h2>

<p>The ideas in this series come from daily practice—shipping agent-assisted testing tools for real protocol security work. But daily practice has blind spots.</p>

<p>If you think I’m wrong about something, I’d like to hear it. If you think I’m right but missing a nuance, I’d <em>especially</em> like to hear that.</p>

<hr />

<p><strong>Next:</strong> <a href="/2026/02/01/agent-skills-and-claude-lint/">Agent Skills and claude-lint</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[A series on agentic testing in 2026: how to build bug-finding systems that work because the oracles are real, traces are exercised, and findings are triaged.]]></summary></entry><entry><title type="html">The Expression Problem in Practice: A Trait-Based Testing Harness</title><link href="https://blog.nikosbaxevanis.com/2025/03/25/expression-problem-in-practice/" rel="alternate" type="text/html" title="The Expression Problem in Practice: A Trait-Based Testing Harness" /><published>2025-03-25T00:00:00+00:00</published><updated>2025-03-25T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2025/03/25/expression-problem-in-practice</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2025/03/25/expression-problem-in-practice/"><![CDATA[<p><em>This post is part of the <a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a> series.</em></p>

<p>We started this series with a production bug that couldn’t be reproduced. We end with a framework that not only can catch that bug, but fundamentally change how we think about testing complex systems. The journey reveals practical lessons about the <a href="https://en.wikipedia.org/wiki/Expression_problem">expression problem</a> that extend far beyond testing.</p>

<h2 id="the-design-that-emerged">The Design That Emerged</h2>

<p>Through trial and error, madhouse-rs converged on a simple but powerful architecture, as described in the <a href="https://github.com/moodmosaic/stacks-core/commit/1cb033c39947d8b6c999fcf68fca3009db2e3263">whitepaper commit</a>:</p>

<p>Each <code class="language-plaintext highlighter-rouge">Command</code> follows a predictable lifecycle:</p>
<ol>
  <li><strong>Generated</strong> by a proptest <code class="language-plaintext highlighter-rouge">Strategy</code></li>
  <li><strong>Validated</strong> via <code class="language-plaintext highlighter-rouge">check</code> against current state</li>
  <li><strong>Applied</strong> via <code class="language-plaintext highlighter-rouge">apply</code>, mutating both model and real system</li>
  <li><strong>Verified</strong> through assertions and postconditions</li>
</ol>

<h2 id="why-traits-won-over-enums">Why Traits Won Over Enums</h2>

<p>The contrast with proptest-state-machine is instructive. Consider how each approach handles a new test operation:</p>

<p><strong>Enum approach (proptest-state-machine):</strong></p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// 1. Add to the central enum (affects everyone).</span>
<span class="k">enum</span> <span class="n">SystemTransition</span> <span class="p">{</span>
    <span class="n">ExistingOp1</span><span class="p">,</span>
    <span class="n">ExistingOp2</span><span class="p">,</span>
    <span class="nf">NewOperation</span><span class="p">(</span><span class="n">NewOpData</span><span class="p">),</span> <span class="c1">// &lt;- New variant.</span>
<span class="p">}</span>

<span class="c1">// 2. Update the central apply function (affects everyone).</span>
<span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="n">state</span><span class="p">:</span> <span class="n">State</span><span class="p">,</span> <span class="n">transition</span><span class="p">:</span> <span class="n">SystemTransition</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">State</span> <span class="p">{</span>
    <span class="k">match</span> <span class="n">transition</span> <span class="p">{</span>
        <span class="nn">SystemTransition</span><span class="p">::</span><span class="n">ExistingOp1</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="cm">/* existing logic */</span> <span class="p">}</span>
        <span class="nn">SystemTransition</span><span class="p">::</span><span class="n">ExistingOp2</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="cm">/* existing logic */</span> <span class="p">}</span>
        <span class="nn">SystemTransition</span><span class="p">::</span><span class="nf">NewOperation</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="k">=&gt;</span> <span class="p">{</span> <span class="c1">// &lt;- New arm.</span>
            <span class="c1">// New logic scattered across this central function.</span>
        <span class="p">}</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="c1">// 3. Update the transitions function (affects everyone).</span>
<span class="k">fn</span> <span class="nf">transitions</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="n">BoxedStrategy</span><span class="o">&lt;</span><span class="n">SystemTransition</span><span class="o">&gt;</span> <span class="p">{</span>
    <span class="nd">prop_oneof!</span><span class="p">[</span>
        <span class="nf">existing_strategy_1</span><span class="p">(),</span>
        <span class="nf">existing_strategy_2</span><span class="p">(),</span>
        <span class="nf">new_operation_strategy</span><span class="p">(),</span> <span class="c1">// &lt;- New generator.</span>
    <span class="p">]</span><span class="nf">.boxed</span><span class="p">()</span>
<span class="p">}</span>
</code></pre></div></div>

<p><strong>Trait approach (madhouse-rs):</strong></p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Self-contained - zero impact on existing code.</span>
<span class="k">struct</span> <span class="n">NewOperationCommand</span> <span class="p">{</span>
    <span class="n">data</span><span class="p">:</span> <span class="n">NewOpData</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">SystemState</span><span class="p">,</span> <span class="n">SystemContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">NewOperationCommand</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">SystemState</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="c1">// Preconditions logic here.</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">SystemState</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Application logic here.</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"NEW_OPERATION({:?})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.data</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span><span class="n">ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">SystemContext</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">SystemState</span><span class="p">,</span> <span class="n">SystemContext</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
        <span class="c1">// Generation strategy here.</span>
        <span class="nf">new_operation_strategy</span><span class="p">()</span>
            <span class="nf">.prop_map</span><span class="p">(|</span><span class="n">data</span><span class="p">|</span> <span class="nn">CommandWrapper</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">NewOperationCommand</span> <span class="p">{</span> <span class="n">data</span> <span class="p">}))</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The difference is profound: <strong>trait-based commands are autonomous</strong>. All logic—generation, preconditions, application, and labeling—lives in one place. No coordination required.</p>

<h2 id="real-world-scale-the-pox-4-experience">Real-World Scale: The PoX-4 Experience</h2>

<p>Before madhouse-rs, we applied these principles with <a href="https://radubahmata.com/">Radu Bahmata</a> to test the Proof-of-Transfer (PoX-4) consensus using TypeScript and <a href="https://github.com/dubzzz/fast-check">fast-check</a>. The harness grew to include 20+ command types, each testing different aspects of the staking protocol:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">StackStxCommand</code> - Delegate STX tokens to a stacker</li>
  <li><code class="language-plaintext highlighter-rouge">DelegateStxCommand</code> - Delegate stacking rights to a pool</li>
  <li><code class="language-plaintext highlighter-rouge">StackAggregationCommitCommand</code> - Commit aggregated stacking transactions</li>
  <li><code class="language-plaintext highlighter-rouge">RevokeDelegateStxCommand</code> - Revoke previously delegated stacking rights</li>
  <li><code class="language-plaintext highlighter-rouge">StackExtendCommand</code> - Extend an existing stacking commitment</li>
  <li><code class="language-plaintext highlighter-rouge">GetStackerInfoCommand</code> - Query stacker information and verify state</li>
  <li>… and many, <em>many</em>, more.</li>
</ul>

<p>The key insight: <strong>each command class was self-contained</strong>. A developer could add <code class="language-plaintext highlighter-rouge">StackExtendCommand</code> without understanding the internals of <code class="language-plaintext highlighter-rouge">DelegateStxCommand</code>. The framework composed them automatically.</p>

<p>When a test failed after 200+ operations, the shrinking algorithm would reduce it to something like:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Original sequence: [200+ operations]
Shrunk to: [
    DelegateStx(account, pool),
    StackAggregationCommit(pool, account),
    RevokeDelegateStx(account),
    StackAggregationCommit(pool, account)
]
</code></pre></div></div>

<p>This four-step sequence revealed a subtle bug: revoking delegation didn’t properly invalidate pending aggregation commits. Finding this manually would have taken weeks.</p>

<h2 id="lessons-for-system-design">Lessons for System Design</h2>

<p>The expression problem appears everywhere in software design, not just testing frameworks:</p>

<h3 id="1-plugin-architectures">1. <strong>Plugin Architectures</strong></h3>
<p>Want users to extend your system with new functionality? Choose the “data-open” side—make plugins implement traits rather than forcing them to modify central enums.</p>

<h3 id="2-event-systems">2. <strong>Event Systems</strong></h3>
<p>Need to handle dozens of event types? Each event type should be its own struct implementing an Event <em>trait</em>, not variants in a central enum.</p>

<h3 id="3-command-patterns">3. <strong>Command Patterns</strong></h3>
<p>Building a command-line tool with subcommands? Each subcommand should be its own type, not a variant in a central enum.</p>

<h3 id="4-middleware-systems">4. <strong>Middleware Systems</strong></h3>
<p>Web frameworks often choose the “data-open” side: each middleware is its own type implementing a common trait.</p>

<h2 id="the-cost-of-getting-it-wrong">The Cost of Getting It Wrong</h2>

<p>We’ve seen both sides of this trade-off in practice:</p>

<p><strong>When the enum approach breaks down:</strong></p>
<ul>
  <li>Central files become merge conflict magnets.</li>
  <li>Adding new variants requires understanding the entire system.</li>
  <li>Logic becomes scattered across multiple functions.</li>
  <li>New contributors face a high barrier to entry.</li>
</ul>

<p><strong>When the trait approach breaks down:</strong></p>
<ul>
  <li>Adding new operations to the trait forces updates everywhere.</li>
  <li>Abstract operations are harder to optimize.</li>
  <li>Dynamic dispatch can impact performance.</li>
  <li>Trait objects introduce complexity.</li>
</ul>

<p>For madhouse-rs, the trade-off was clear: we needed to add new test operations constantly, but the core operations (<code class="language-plaintext highlighter-rouge">check</code>, <code class="language-plaintext highlighter-rouge">apply</code>, <code class="language-plaintext highlighter-rouge">label</code>, <code class="language-plaintext highlighter-rouge">build</code>) were stable. The “data-open” choice was correct.</p>

<h2 id="performance-considerations">Performance Considerations</h2>

<p>One concern with trait-based approaches is performance. <code class="language-plaintext highlighter-rouge">CommandWrapper</code> uses <code class="language-plaintext highlighter-rouge">Arc&lt;dyn Command&lt;S, C&gt;&gt;</code>, which involves heap allocation and dynamic dispatch. In our testing scenarios, this overhead was negligible compared to the actual blockchain operations being tested.</p>

<h2 id="the-full-circle">The Full Circle</h2>

<p>We began with a simple question: how do you design systems that are easy to extend? The expression problem provided the theoretical framework, but the real learning came from building systems that needed to scale.</p>

<p>The Stacks blockchain bug that started this journey taught us that <strong>complexity is the enemy of correctness</strong>. Traditional testing assumes you can predict where bugs hide. Model-based testing with madhouse-rs assumes you can’t—so it generates the chaos systematically.</p>

<p>The trait-based design made this scalable. Instead of a monolithic test harness that becomes unmaintainable, we have an ecosystem of autonomous commands that compose naturally.</p>

<h2 id="practical-takeaways">Practical Takeaways</h2>

<ol>
  <li>
    <p><strong>Choose your trade-off consciously</strong>: The expression problem forces a choice. Understanding the trade-off helps you pick the right tool.</p>
  </li>
  <li>
    <p><strong>Favor autonomy at scale</strong>: When systems grow large, autonomous components (traits) usually scale better than centralized ones (enums).</p>
  </li>
  <li>
    <p><strong>Let chaos find the bugs</strong>: For complex systems, generated test scenarios often find bugs that manual tests miss.</p>
  </li>
  <li>
    <p><strong>Design for shrinking</strong>: When random tests fail, automatic reduction to minimal cases is invaluable.</p>
  </li>
  <li>
    <p><strong>Start simple, then scale</strong>: Both approaches work for small systems. The difference emerges at scale.</p>
  </li>
</ol>

<p>The expression problem isn’t academic theory—it’s a practical design constraint that affects every system you build. Understanding it helps you make better architectural choices, whether you’re building testing frameworks, plugin systems, or distributed applications.</p>

<p>In the end, good design isn’t about avoiding trade-offs. It’s about making them consciously, understanding their implications, and choosing the ones that align with how your system needs to grow.</p>

<h2 id="references-and-further-reading">References and Further Reading</h2>

<p>The ideas in this series draw from decades of research and practice:</p>

<ol>
  <li><a href="http://homepages.inf.ed.ac.uk/wadler/papers/expression/expression.txt">Philip Wadler’s original expression problem</a></li>
  <li><a href="https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf">QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs</a></li>
  <li><a href="https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quviq-testing.pdf">Experiences with QuickCheck: Testing the Hard Stuff and Staying Sane</a></li>
  <li><a href="https://www.quviq.com/documentation/eqc/eqc_statem.html">eqc_statem documentation</a></li>
  <li><a href="https://blog.nikosbaxevanis.com/2022/03/15/clarity-clarity-model-based-testing-primer/">Clarity Model-Based Testing Primer</a></li>
  <li><a href="https://github.com/hedgehogqa/haskell-hedgehog/blob/master/hedgehog/src/Hedgehog/Internal/State.hs">Hedgehog State.hs - Haskell stateful testing implementation</a></li>
  <li><a href="https://github.com/hedgehogqa/haskell-hedgehog/blob/master/hedgehog-example/src/Test/Example/References.hs">Hedgehog References.hs - Practical stateful testing example</a></li>
  <li><a href="https://github.com/stacks-network/stacks-core/blob/2caa9bfc057aa7885422e9d6b178be6776812c54/contrib/boot-contracts-stateful-prop-tests/tests/pox-4/pox_Commands.ts">PoX-4 Commands TypeScript - Original disjointed command implementation</a></li>
  <li><a href="https://github.com/stacks-network/stacks-core/pull/5691#discussion_r1916145655">The original GitHub comment that sparked madhouse-rs</a></li>
  <li><a href="https://github.com/stacks-network/stacks-core/pull/6007">The pull request where both traditional and madhouse-rs approaches reproduced the production bug</a></li>
</ol>

<hr />

<p><strong>Series Complete:</strong> <a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a> series.</p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part of the Expression Problem in Rust series, concluding with real-world lessons from building a trait-based testing framework that scales from dozens to hundreds of test operations.]]></summary></entry><entry><title type="html">Chaos Testing stacks-node with Model-Based Stateful Testing</title><link href="https://blog.nikosbaxevanis.com/2025/03/10/chaos-testing-stacks-node/" rel="alternate" type="text/html" title="Chaos Testing stacks-node with Model-Based Stateful Testing" /><published>2025-03-10T00:00:00+00:00</published><updated>2025-03-10T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2025/03/10/chaos-testing-stacks-node</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2025/03/10/chaos-testing-stacks-node/"><![CDATA[<p><em>This post is part of the <a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a> series.</em></p>

<p>Theory is useful, but does the trait-based design actually work in practice? Can it scale to test a real, complex distributed system? The answer came from an unexpected place: a production bug in the Stacks blockchain that refused to be reproduced.</p>

<h2 id="the-bug-that-couldnt-be-caught">The Bug That Couldn’t Be Caught</h2>

<p>In early 2024, Stacks mainnet experienced a stall. After a reorg, miners would occasionally fail to build on their own blocks, disrupting the consensus mechanism. The behavior was intermittent and seemed to depend on precise timing and network conditions.</p>

<p>Core developer <a href="https://github.com/stacks-network/stacks-core/pull/5691">Brice Dobry attempted to write a traditional test</a>—a masterfully crafted 533-line integration test with sophisticated setup and coordination:</p>

<p><em>“In this test, I attempted to reproduce the scenario we saw in mainnet, in which the miner mines a tenure change block in this reorg scenario, but then fails to mine another block building off of that one. I was unable to reproduce that behavior, but this still seems like a useful test to have.”</em></p>

<p>The test included complex manual orchestration:</p>
<ul>
  <li>Detailed miner setup with specific configuration.</li>
  <li>Manual transaction submission and timing coordination.</li>
  <li>Explicit waiting periods and state verification.</li>
  <li>Hundreds of lines of boilerplate setup code.</li>
</ul>

<p>Yet even with Brice’s expertise and this carefully crafted test, the production bug remained elusive. This wasn’t a reflection of the test quality—it highlighted just how subtle and context-dependent the bug was. That’s when a <a href="https://github.com/stacks-network/stacks-core/pull/5691#discussion_r1916145655">radical idea emerged</a>:</p>

<p><em>“We could shift to a command-based model test. Each step, like ‘miner commits block’ or ‘signer accepts block,’ becomes its own command that updates a small state model and the actual chain. Then we run random sequences of these commands to reveal hidden corners.”</em></p>

<p>This comment became the genesis of <a href="https://github.com/stacks-network/madhouse-rs">madhouse-rs</a>.</p>

<h2 id="from-idea-to-implementation">From Idea to Implementation</h2>

<p>The insight was profound: instead of trying to predict where bugs might hide, <strong>let chaos find them</strong>. Model the entire blockchain testing scenario as a collection of autonomous commands, then generate thousands of random sequences.</p>

<p>Here’s how it looked in practice. The test harness included commands like:</p>

<p><em>Note: The following examples are conceptual illustrations that demonstrate the core patterns. The actual implementation uses more complex blockchain-specific types and operations.</em></p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Each command encapsulates one blockchain operation.</span>
<span class="k">struct</span> <span class="n">MineBitcoinBlockCommand</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">block_height</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">StacksState</span><span class="p">,</span> <span class="n">StacksContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">MineBitcoinBlockCommand</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">StacksState</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="c1">// Only mine if we're not too far ahead.</span>
        <span class="k">self</span><span class="py">.block_height</span> <span class="o">&lt;=</span> <span class="n">state</span><span class="py">.tip_height</span> <span class="o">+</span> <span class="mi">10</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">StacksState</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Update both the model state and the actual blockchain.</span>
        <span class="n">state</span><span class="py">.bitcoin_blocks</span><span class="nf">.push</span><span class="p">(</span><span class="k">self</span><span class="py">.block_height</span><span class="p">);</span>
        <span class="n">state</span><span class="py">.tip_height</span> <span class="o">=</span> <span class="k">self</span><span class="py">.block_height</span><span class="p">;</span>

        <span class="c1">// Actual blockchain interaction.</span>
        <span class="nf">mine_bitcoin_block</span><span class="p">(</span><span class="k">self</span><span class="py">.block_height</span><span class="p">);</span>

        <span class="c1">// Verify post-conditions.</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="nf">get_bitcoin_tip_height</span><span class="p">(),</span> <span class="k">self</span><span class="py">.block_height</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"MINE_BITCOIN_BLOCK({})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.block_height</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span><span class="n">ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">StacksContext</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">StacksState</span><span class="p">,</span> <span class="n">StacksContext</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
        <span class="p">(</span><span class="n">ctx</span><span class="py">.current_height</span><span class="o">..</span><span class="n">ctx</span><span class="py">.current_height</span> <span class="o">+</span> <span class="mi">5</span><span class="p">)</span>
            <span class="nf">.prop_map</span><span class="p">(|</span><span class="n">height</span><span class="p">|</span> <span class="nn">CommandWrapper</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">MineBitcoinBlockCommand</span> <span class="p">{</span> <span class="n">block_height</span><span class="p">:</span> <span class="n">height</span> <span class="p">}))</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">SubmitBlockCommitCommand</span> <span class="p">{</span>
    <span class="k">pub</span> <span class="n">miner_id</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="k">pub</span> <span class="n">bitcoin_block_height</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">StacksState</span><span class="p">,</span> <span class="n">StacksContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">SubmitBlockCommitCommand</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">StacksState</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="c1">// Can only commit if the Bitcoin block exists.</span>
        <span class="n">state</span><span class="py">.bitcoin_blocks</span><span class="nf">.contains</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="py">.bitcoin_block_height</span><span class="p">)</span> <span class="o">&amp;&amp;</span>
        <span class="n">state</span><span class="py">.miners</span><span class="nf">.contains_key</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="py">.miner_id</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">StacksState</span><span class="p">)</span> <span class="p">{</span>
        <span class="c1">// Track the commit in model state.</span>
        <span class="n">state</span><span class="py">.block_commits</span><span class="nf">.push</span><span class="p">(</span><span class="n">BlockCommit</span> <span class="p">{</span>
            <span class="n">miner</span><span class="p">:</span> <span class="k">self</span><span class="py">.miner_id</span><span class="p">,</span>
            <span class="n">bitcoin_height</span><span class="p">:</span> <span class="k">self</span><span class="py">.bitcoin_block_height</span><span class="p">,</span>
        <span class="p">});</span>

        <span class="c1">// Submit to actual blockchain.</span>
        <span class="nf">submit_block_commit</span><span class="p">(</span><span class="k">self</span><span class="py">.miner_id</span><span class="p">,</span> <span class="k">self</span><span class="py">.bitcoin_block_height</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"SUBMIT_COMMIT(miner={}, btc_height={})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.miner_id</span><span class="p">,</span> <span class="k">self</span><span class="py">.bitcoin_block_height</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span><span class="n">ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">StacksContext</span><span class="o">&gt;</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">StacksState</span><span class="p">,</span> <span class="n">StacksContext</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="n">miners</span> <span class="o">=</span> <span class="n">ctx</span><span class="py">.miners</span><span class="nf">.clone</span><span class="p">();</span>
        <span class="k">let</span> <span class="n">heights</span> <span class="o">=</span> <span class="n">ctx</span><span class="py">.available_bitcoin_heights</span><span class="nf">.clone</span><span class="p">();</span>

        <span class="p">(</span><span class="nn">prop</span><span class="p">::</span><span class="nn">sample</span><span class="p">::</span><span class="nf">select</span><span class="p">(</span><span class="n">miners</span><span class="p">),</span> <span class="nn">prop</span><span class="p">::</span><span class="nn">sample</span><span class="p">::</span><span class="nf">select</span><span class="p">(</span><span class="n">heights</span><span class="p">))</span>
            <span class="nf">.prop_map</span><span class="p">(|(</span><span class="n">miner</span><span class="p">,</span> <span class="n">height</span><span class="p">)|</span> <span class="p">{</span>
                <span class="nn">CommandWrapper</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">SubmitBlockCommitCommand</span> <span class="p">{</span>
                    <span class="n">miner_id</span><span class="p">:</span> <span class="n">miner</span><span class="p">,</span>
                    <span class="n">bitcoin_block_height</span><span class="p">:</span> <span class="n">height</span><span class="p">,</span>
                <span class="p">})</span>
            <span class="p">})</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="the-breakthrough-real-chaos-real-bugs">The Breakthrough: Real Chaos, Real Bugs</h2>

<p>The actual test scenario that finally reproduced the production bug looked like this (<a href="https://github.com/stacks-network/stacks-core/pull/6007">from the PR that fixed it</a>, where both Brice’s traditional script and the madhouse-rs approach successfully reproduced the issue):</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">scenario!</span><span class="p">[</span>
    <span class="n">test_context</span><span class="p">,</span>
    <span class="n">SkipCommitOpMiner2</span><span class="p">,</span>
    <span class="n">BootToEpoch3</span><span class="p">,</span>
    <span class="n">SkipCommitOpMiner1</span><span class="p">,</span>
    <span class="n">PauseStacksMining</span><span class="p">,</span>
    <span class="n">MineBitcoinBlock</span><span class="p">,</span>
    <span class="n">VerifyMiner1WonSortition</span><span class="p">,</span>
    <span class="n">SubmitBlockCommitMiner2</span><span class="p">,</span>
    <span class="n">ResumeStacksMining</span><span class="p">,</span>
    <span class="n">WaitForTenureChangeBlockFromMiner1</span><span class="p">,</span>
    <span class="n">MineBitcoinBlock</span><span class="p">,</span>
    <span class="n">VerifyMiner2WonSortition</span><span class="p">,</span>
    <span class="n">VerifyLastSortitionWinnerReorged</span><span class="p">,</span>
    <span class="n">WaitForTenureChangeBlockFromMiner2</span><span class="p">,</span>
    <span class="n">ShutdownMiners</span>
<span class="p">]</span>
</code></pre></div></div>

<p>But here’s the key: <strong>this wasn’t the only sequence tested</strong>. When run with <code class="language-plaintext highlighter-rouge">MADHOUSE=1</code>, the framework generated thousands of variations:</p>

<ul>
  <li>What if <code class="language-plaintext highlighter-rouge">MineBitcoinBlock</code> happened before <code class="language-plaintext highlighter-rouge">SubmitBlockCommitMiner2</code>?</li>
  <li>What if <code class="language-plaintext highlighter-rouge">PauseStacksMining</code> occurred at different points?</li>
  <li>What if multiple miners competed in different orders?</li>
</ul>

<p>One of these chaotic permutations finally triggered the exact conditions that caused the production bug. The test failed, and <strong>the framework automatically shrunk</strong> the failing sequence to a minimal reproduction case.</p>

<h2 id="the-power-of-shrinking">The Power of Shrinking</h2>

<p>When <code class="language-plaintext highlighter-rouge">madhouse-rs</code> found a failing test scenario, it didn’t just report a 200-step chaos sequence. The framework systematically removed operations until it found the minimal case that still triggered the bug:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Original failing sequence: [120 operations...]
Shrunk to: [
    MineBitcoinBlock,
    DisconnectNode(node_2),
    SubmitBlockCommit(miner_1),
    ReconnectNode(node_2)
]
</code></pre></div></div>

<p>This minimal reproduction became the foundation for understanding and fixing the bug. What would have taken weeks of manual debugging was reduced to a four-step reproduction script.</p>

<h2 id="why-traditional-testing-failed">Why Traditional Testing Failed</h2>

<p>The bug existed at the intersection of:</p>
<ul>
  <li>Network timing (when nodes reconnected).</li>
  <li>Blockchain state (which blocks were mined when).</li>
  <li>Miner behavior (who submitted commits and when).</li>
</ul>

<p>Traditional integration tests assume you can predict these intersections. They script specific scenarios: “First do X, then Y, then Z.” But production bugs don’t follow scripts—they emerge from the <strong>unexpected combinations</strong> that nobody thought to test.</p>

<p>Model-based testing with <code class="language-plaintext highlighter-rouge">madhouse-rs</code> reverses this assumption: instead of predicting where bugs live, <strong>generate the combinations and let the bugs reveal themselves</strong>.</p>

<h2 id="the-technical-architecture">The Technical Architecture</h2>

<p>The success of this approach depended on the trait-based design:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// The complete test setup.</span>
<span class="nd">#[derive(Debug,</span> <span class="nd">Default)]</span>
<span class="k">struct</span> <span class="n">StacksTestState</span> <span class="p">{</span>
    <span class="n">bitcoin_blocks</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="nb">u64</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">stacks_blocks</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">StacksBlock</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">miners</span><span class="p">:</span> <span class="n">HashMap</span><span class="o">&lt;</span><span class="nb">u32</span><span class="p">,</span> <span class="n">MinerState</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="n">network_partitions</span><span class="p">:</span> <span class="nb">Vec</span><span class="o">&lt;</span><span class="n">Partition</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="c1">// ... dozens more fields tracking blockchain state.</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">State</span> <span class="k">for</span> <span class="n">StacksTestState</span> <span class="p">{}</span>

<span class="c1">// Context with test parameters.</span>
<span class="nd">#[derive(Debug,</span> <span class="nd">Clone)]</span>
<span class="k">struct</span> <span class="n">StacksTestContext</span> <span class="p">{</span>
    <span class="n">num_miners</span><span class="p">:</span> <span class="nb">u32</span><span class="p">,</span>
    <span class="n">bitcoin_block_time</span><span class="p">:</span> <span class="n">Duration</span><span class="p">,</span>
    <span class="n">network_delay_range</span><span class="p">:</span> <span class="p">(</span><span class="n">Duration</span><span class="p">,</span> <span class="n">Duration</span><span class="p">),</span>
    <span class="c1">// ... configuration parameters.</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">TestContext</span> <span class="k">for</span> <span class="n">StacksTestContext</span> <span class="p">{}</span>
</code></pre></div></div>

<p>Each command was self-contained. Adding a new blockchain operation—like <code class="language-plaintext highlighter-rouge">NetworkPartitionCommand</code> or <code class="language-plaintext highlighter-rouge">RestartNodeCommand</code>—required zero changes to existing commands. The trait-based design made it possible to build a test harness with 50+ distinct operations, each developed and tested independently.</p>

<h2 id="the-real-world-impact">The Real-World Impact</h2>

<p>This wasn’t just an academic exercise. The chaos testing approach:</p>

<ol>
  <li><strong>Found the production bug</strong> that traditional testing missed.</li>
  <li><strong>Provided a minimal reproduction</strong> case for debugging.</li>
  <li><strong>Validated the fix</strong> by running thousands of variations to ensure the bug was truly resolved.</li>
  <li><strong>Enabled ongoing regression testing</strong> with the same chaos generation.</li>
</ol>

<p>The framework runs in CI, continuously generating new chaotic scenarios to catch regressions before they reach production.</p>

<h2 id="from-chaos-to-confidence">From Chaos to Confidence</h2>

<p>The lesson isn’t that traditional testing is worthless—it’s that <strong>certain classes of bugs only emerge from chaos</strong>. Race conditions, timing issues, and complex state interactions hide in the combinations that manual tests never explore.</p>

<p>Model-based testing with madhouse-rs turns chaos into a systematic testing strategy. The trait-based design makes it sustainable at scale. The automatic shrinking makes failures actionable.</p>

<p>This is how we can move from <em>“I was unable to reproduce that behavior”</em> to reproducible test cases that can catch production bugs before they happen.</p>

<hr />

<p><strong>Next:</strong> <a href="/2025/03/25/expression-problem-in-practice/">The Expression Problem in Practice: A Trait-Based Testing Harness</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part of the Expression Problem in Rust series, applying madhouse-rs to reproduce a real production bug in the Stacks blockchain that traditional testing couldn't catch.]]></summary></entry><entry><title type="html">Scaling Model-Based Stateful Testing with madhouse-rs</title><link href="https://blog.nikosbaxevanis.com/2025/02/10/scaling-with-madhouse-rs/" rel="alternate" type="text/html" title="Scaling Model-Based Stateful Testing with madhouse-rs" /><published>2025-02-10T00:00:00+00:00</published><updated>2025-02-10T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2025/02/10/scaling-with-madhouse-rs</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2025/02/10/scaling-with-madhouse-rs/"><![CDATA[<p><em>This post is part of the <a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a> series.</em></p>

<p>In the <a href="/2025/01/10/state-machine-testing-proptest">previous post</a>, we saw how proptest-state-machine’s enum-based design becomes a bottleneck when scaling to hundreds of operations. What if there was a different approach—one that embraced the “data-open” side of the <a href="https://en.wikipedia.org/wiki/Expression_problem">expression problem</a>?</p>

<p><a href="https://github.com/stacks-network/madhouse-rs"><code class="language-plaintext highlighter-rouge">madhouse-rs</code></a> was born from this exact frustration. When trying to reproduce that <a href="https://github.com/stacks-network/stacks-core/pull/5691">elusive Stacks mainnet bug</a>, the traditional enum approach simply couldn’t scale to the complexity needed.</p>

<h2 id="the-trait-based-approach">The Trait-Based Approach</h2>

<p>Instead of a central enum, <code class="language-plaintext highlighter-rouge">madhouse-rs</code> makes each command its own type implementing a stable <code class="language-plaintext highlighter-rouge">Command</code> trait. There is no central bottleneck—no enum to extend, no monolithic match statement to update.</p>

<p>Let’s return to our counter example from the previous post to see how this trait-based approach works in practice:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">madhouse</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">proptest</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">sync</span><span class="p">::</span><span class="nb">Arc</span><span class="p">;</span>

<span class="c1">// Define your state and context.</span>
<span class="nd">#[derive(Debug,</span> <span class="nd">Default)]</span>
<span class="k">struct</span> <span class="n">CounterState</span> <span class="p">{</span>
    <span class="n">value</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
    <span class="n">max_value</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">State</span> <span class="k">for</span> <span class="n">CounterState</span> <span class="p">{}</span>

<span class="nd">#[derive(Debug,</span> <span class="nd">Clone,</span> <span class="nd">Default)]</span>
<span class="k">struct</span> <span class="n">CounterContext</span> <span class="p">{</span>
    <span class="n">increment_range</span><span class="p">:</span> <span class="p">(</span><span class="nb">u64</span><span class="p">,</span> <span class="nb">u64</span><span class="p">),</span>
<span class="p">}</span>
<span class="k">impl</span> <span class="n">TestContext</span> <span class="k">for</span> <span class="n">CounterContext</span> <span class="p">{}</span>

<span class="c1">// Each operation is its own self-contained type.</span>
<span class="k">struct</span> <span class="n">IncrementCommand</span> <span class="p">{</span>
    <span class="n">amount</span><span class="p">:</span> <span class="nb">u64</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">CounterState</span><span class="p">,</span> <span class="n">CounterContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">IncrementCommand</span> <span class="p">{</span>
    <span class="c1">// Check preconditions against the model state.</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">CounterState</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="n">state</span><span class="py">.value</span> <span class="o">+</span> <span class="k">self</span><span class="py">.amount</span> <span class="o">&lt;=</span> <span class="n">state</span><span class="py">.max_value</span>
    <span class="p">}</span>

    <span class="c1">// Apply the command to both model and real system.</span>
    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">CounterState</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span><span class="py">.value</span> <span class="o">+=</span> <span class="k">self</span><span class="py">.amount</span><span class="p">;</span>
        <span class="c1">// In a real test, you'd also apply to the actual system here.</span>
        <span class="nd">println!</span><span class="p">(</span><span class="s">"Incremented counter by {}, now at {}"</span><span class="p">,</span> <span class="k">self</span><span class="py">.amount</span><span class="p">,</span> <span class="n">state</span><span class="py">.value</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="c1">// Human-readable label for debugging.</span>
    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="nd">format!</span><span class="p">(</span><span class="s">"INCREMENT({})"</span><span class="p">,</span> <span class="k">self</span><span class="py">.amount</span><span class="p">)</span>
    <span class="p">}</span>

    <span class="c1">// Strategy for generating instances of this command.</span>
    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span>
        <span class="n">ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">CounterContext</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="p">)</span> <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">CounterState</span><span class="p">,</span> <span class="n">CounterContext</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
        <span class="k">let</span> <span class="p">(</span><span class="n">min</span><span class="p">,</span> <span class="n">max</span><span class="p">)</span> <span class="o">=</span> <span class="n">ctx</span><span class="py">.increment_range</span><span class="p">;</span>
        <span class="p">(</span><span class="n">min</span><span class="o">..=</span><span class="n">max</span><span class="p">)</span><span class="nf">.prop_map</span><span class="p">(|</span><span class="n">amount</span><span class="p">|</span> <span class="nn">CommandWrapper</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">IncrementCommand</span> <span class="p">{</span> <span class="n">amount</span> <span class="p">}))</span>
    <span class="p">}</span>
<span class="p">}</span>

<span class="k">struct</span> <span class="n">ResetCommand</span><span class="p">;</span>

<span class="k">impl</span> <span class="n">Command</span><span class="o">&lt;</span><span class="n">CounterState</span><span class="p">,</span> <span class="n">CounterContext</span><span class="o">&gt;</span> <span class="k">for</span> <span class="n">ResetCommand</span> <span class="p">{</span>
    <span class="k">fn</span> <span class="nf">check</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="n">CounterState</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">bool</span> <span class="p">{</span>
        <span class="n">state</span><span class="py">.value</span> <span class="o">&gt;</span> <span class="mi">0</span>  <span class="c1">// Only reset if there's something to reset.</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">,</span> <span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="n">CounterState</span><span class="p">)</span> <span class="p">{</span>
        <span class="n">state</span><span class="py">.value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
        <span class="nd">println!</span><span class="p">(</span><span class="s">"Counter reset to 0"</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">label</span><span class="p">(</span><span class="o">&amp;</span><span class="k">self</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="nb">String</span> <span class="p">{</span>
        <span class="s">"RESET"</span><span class="nf">.to_string</span><span class="p">()</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">build</span><span class="p">(</span>
        <span class="n">_ctx</span><span class="p">:</span> <span class="nb">Arc</span><span class="o">&lt;</span><span class="n">CounterContext</span><span class="o">&gt;</span><span class="p">,</span>
    <span class="p">)</span> <span class="k">-&gt;</span> <span class="k">impl</span> <span class="n">Strategy</span><span class="o">&lt;</span><span class="n">Value</span> <span class="o">=</span> <span class="n">CommandWrapper</span><span class="o">&lt;</span><span class="n">CounterState</span><span class="p">,</span> <span class="n">CounterContext</span><span class="o">&gt;&gt;</span> <span class="p">{</span>
        <span class="nf">Just</span><span class="p">(</span><span class="nn">CommandWrapper</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">ResetCommand</span><span class="p">))</span>
    <span class="p">}</span>
<span class="p">}</span>

</code></pre></div></div>

<h2 id="running-the-scenario">Running the Scenario</h2>

<p>With <code class="language-plaintext highlighter-rouge">madhouse-rs</code>, you compose test scenarios using the <code class="language-plaintext highlighter-rouge">scenario!</code> macro:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">test_counter_chaos</span><span class="p">()</span> <span class="p">{</span>
    <span class="k">let</span> <span class="n">test_context</span> <span class="o">=</span> <span class="nn">Arc</span><span class="p">::</span><span class="nf">new</span><span class="p">(</span><span class="n">CounterContext</span> <span class="p">{</span>
        <span class="n">increment_range</span><span class="p">:</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">),</span>
    <span class="p">});</span>

    <span class="c1">// Run the scenario - madhouse-rs handles the rest.</span>
    <span class="nd">scenario!</span><span class="p">[</span>
        <span class="n">test_context</span><span class="p">,</span>
        <span class="n">IncrementCommand</span><span class="p">,</span>
        <span class="n">ResetCommand</span><span class="p">,</span>
        <span class="p">(</span><span class="n">IncrementCommand</span> <span class="p">{</span> <span class="n">amount</span><span class="p">:</span> <span class="mi">42</span> <span class="p">})</span>  <span class="c1">// Fixed command instance.</span>
    <span class="p">];</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="the-power-of-data-open-design">The Power of Data-Open Design</h2>

<p>What makes this approach scale? <strong>Each command is autonomous</strong>:</p>

<ul>
  <li><strong>Self-contained logic</strong>: Generation, preconditions, and application logic all live together.</li>
  <li><strong>No central bottleneck</strong>: Adding <code class="language-plaintext highlighter-rouge">DecrementCommand</code> requires zero edits to existing code.</li>
  <li><strong>Composable</strong>: Mix and match commands freely in different test scenarios.</li>
  <li><strong>Maintainable</strong>: Each command can be developed, tested, and reviewed independently.</li>
</ul>

<h2 id="real-world-impact">Real-World Impact</h2>

<p><strong>Update (June 14, 2025):</strong> This design proved its worth in the Stacks blockchain testing. Consider this actual test scenario from the <a href="https://github.com/stacks-network/stacks-core/pull/6007">stacks-core PR #6007</a> that was merged yesterday:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nd">scenario!</span><span class="p">[</span>
    <span class="n">test_context</span><span class="p">,</span>
    <span class="n">SkipCommitOpMiner2</span><span class="p">,</span>
    <span class="n">BootToEpoch3</span><span class="p">,</span>
    <span class="n">SkipCommitOpMiner1</span><span class="p">,</span>
    <span class="n">PauseStacksMining</span><span class="p">,</span>
    <span class="n">MineBitcoinBlock</span><span class="p">,</span>
    <span class="n">VerifyMiner1WonSortition</span><span class="p">,</span>
    <span class="n">SubmitBlockCommitMiner2</span><span class="p">,</span>
    <span class="n">ResumeStacksMining</span><span class="p">,</span>
    <span class="n">WaitForTenureChangeBlockFromMiner1</span><span class="p">,</span>
    <span class="n">MineBitcoinBlock</span><span class="p">,</span>
    <span class="n">VerifyMiner2WonSortition</span><span class="p">,</span>
    <span class="n">VerifyLastSortitionWinnerReorged</span><span class="p">,</span>
    <span class="n">WaitForTenureChangeBlockFromMiner2</span><span class="p">,</span>
    <span class="n">ShutdownMiners</span>
<span class="p">]</span>
</code></pre></div></div>

<p>Each of those 14+ operations is a self-contained <code class="language-plaintext highlighter-rouge">Command</code> implementation. No central enum to maintain. No monolithic match statement. No coordination between developers adding new test operations.</p>

<p>More importantly, when the framework runs with <code class="language-plaintext highlighter-rouge">MADHOUSE=1</code>, it generates <strong>random permutations</strong> of these operations, creating chaotic scenarios that manual tests could never explore. This is how the framework can reproduce production bugs that traditional testing might miss.</p>

<h2 id="the-expression-problem-solved">The Expression Problem Solved</h2>

<p>By choosing the “data-open” side, madhouse-rs makes it trivial to add new command types while keeping the core operations (<code class="language-plaintext highlighter-rouge">check</code>, <code class="language-plaintext highlighter-rouge">apply</code>, <code class="language-plaintext highlighter-rouge">label</code>, <code class="language-plaintext highlighter-rouge">build</code>) stable. This is exactly the opposite trade-off from proptest-state-machine, and for model-based testing at scale, it’s the right choice.</p>

<hr />

<p><strong>Next:</strong> <a href="/2025/03/10/chaos-testing-stacks-node/">Chaos Testing stacks-node with Model-Based Stateful Testing</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part of the Expression Problem in Rust series, showing how madhouse-rs uses a trait-based design to escape the enum bottleneck.]]></summary></entry><entry><title type="html">Model-Based Stateful Testing with proptest-state-machine</title><link href="https://blog.nikosbaxevanis.com/2025/01/10/state-machine-testing-proptest/" rel="alternate" type="text/html" title="Model-Based Stateful Testing with proptest-state-machine" /><published>2025-01-10T00:00:00+00:00</published><updated>2025-01-10T00:00:00+00:00</updated><id>https://blog.nikosbaxevanis.com/2025/01/10/state-machine-testing-proptest</id><content type="html" xml:base="https://blog.nikosbaxevanis.com/2025/01/10/state-machine-testing-proptest/"><![CDATA[<p><em>This post is part of the <a href="/2024/12/01/model-based-stateful-testing-with-madhouse-rs/">Model-Based Stateful Testing with madhouse-rs</a> series.</em></p>

<p>Imagine trying to test a distributed system with dozens of operations:</p>
<ul>
  <li>miners submitting blocks</li>
  <li>nodes joining and leaving</li>
  <li>transactions flooding the mempool</li>
  <li>network partitions</li>
  <li>and more.</li>
</ul>

<p>Traditional unit tests can’t capture the chaotic, interleaved nature of these scenarios.</p>

<p><a href="https://en.wikipedia.org/wiki/Model-based_testing">Model-based testing</a> offers a solution: define all possible operations, let the framework generate random sequences, and check that your system behaves correctly. <em>But</em>… how you structure those operations determines whether your test harness scales from 5 commands to 500.</p>

<p><a href="https://github.com/proptest-rs/proptest/tree/main/proptest-state-machine">proptest-state-machine</a> sits firmly on the <strong>operations-open, data-closed</strong> side of the <a href="https://en.wikipedia.org/wiki/Expression_problem">expression problem</a>. You define a central <a href="https://proptest-rs.github.io/proptest/proptest/state-machine.html"><code class="language-plaintext highlighter-rouge">Transition</code></a> enum that lists every possible operation.</p>

<h2 id="the-enum-approach">The Enum Approach</h2>

<p>Let’s explore this with a simple counter example first—while the real-world blockchain scenarios would be more complex to demonstrate initially, the core design patterns are identical. We’ll see the blockchain applications later in the series.</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">proptest_state_machine</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">use</span> <span class="nn">proptest</span><span class="p">::</span><span class="nn">prelude</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>

<span class="c1">// The model state.</span>
<span class="nd">#[derive(Clone,</span> <span class="nd">Debug)]</span>
<span class="k">struct</span> <span class="n">CounterModel</span> <span class="p">{</span>
    <span class="n">value</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span>
    <span class="n">max_value</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span>
<span class="p">}</span>

<span class="c1">// The real system under test.</span>
<span class="k">struct</span> <span class="n">Counter</span> <span class="p">{</span>
    <span class="n">value</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span>
    <span class="n">max_value</span><span class="p">:</span> <span class="nb">i32</span><span class="p">,</span>
<span class="p">}</span>

<span class="c1">// All possible operations in one enum.</span>
<span class="nd">#[derive(Clone,</span> <span class="nd">Debug)]</span>
<span class="k">enum</span> <span class="n">CounterTransition</span> <span class="p">{</span>
    <span class="n">Inc</span><span class="p">,</span>
    <span class="n">Dec</span><span class="p">,</span>
    <span class="n">Reset</span><span class="p">,</span>
<span class="p">}</span>

<span class="k">impl</span> <span class="n">StateMachine</span> <span class="k">for</span> <span class="n">CounterModel</span> <span class="p">{</span>
    <span class="k">type</span> <span class="n">State</span> <span class="o">=</span> <span class="n">CounterModel</span><span class="p">;</span>
    <span class="k">type</span> <span class="n">Sut</span> <span class="o">=</span> <span class="n">Counter</span><span class="p">;</span>
    <span class="k">type</span> <span class="n">Transition</span> <span class="o">=</span> <span class="n">CounterTransition</span><span class="p">;</span>

    <span class="k">fn</span> <span class="nf">init_state</span><span class="p">()</span> <span class="k">-&gt;</span> <span class="n">BoxedStrategy</span><span class="o">&lt;</span><span class="k">Self</span><span class="p">::</span><span class="n">State</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="nf">Just</span><span class="p">(</span><span class="n">CounterModel</span> <span class="p">{</span> <span class="n">value</span><span class="p">:</span> <span class="mi">0</span><span class="p">,</span> <span class="n">max_value</span><span class="p">:</span> <span class="mi">100</span> <span class="p">})</span><span class="nf">.boxed</span><span class="p">()</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">init_sut</span><span class="p">(</span><span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">Self</span><span class="p">::</span><span class="n">State</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">BoxedStrategy</span><span class="o">&lt;</span><span class="k">Self</span><span class="p">::</span><span class="n">Sut</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="nf">Just</span><span class="p">(</span><span class="n">Counter</span> <span class="p">{</span>
            <span class="n">value</span><span class="p">:</span> <span class="n">state</span><span class="py">.value</span><span class="p">,</span>
            <span class="n">max_value</span><span class="p">:</span> <span class="n">state</span><span class="py">.max_value</span>
        <span class="p">})</span><span class="nf">.boxed</span><span class="p">()</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">transitions</span><span class="p">(</span><span class="n">_state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">Self</span><span class="p">::</span><span class="n">State</span><span class="p">)</span> <span class="k">-&gt;</span> <span class="n">BoxedStrategy</span><span class="o">&lt;</span><span class="k">Self</span><span class="p">::</span><span class="n">Transition</span><span class="o">&gt;</span> <span class="p">{</span>
        <span class="nd">prop_oneof!</span><span class="p">[</span>
            <span class="nf">Just</span><span class="p">(</span><span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Inc</span><span class="p">),</span>
            <span class="nf">Just</span><span class="p">(</span><span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Dec</span><span class="p">),</span>
            <span class="nf">Just</span><span class="p">(</span><span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Reset</span><span class="p">),</span>
        <span class="p">]</span><span class="nf">.boxed</span><span class="p">()</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">apply</span><span class="p">(</span>
        <span class="k">mut</span> <span class="n">state</span><span class="p">:</span> <span class="k">Self</span><span class="p">::</span><span class="n">State</span><span class="p">,</span>
        <span class="n">sut</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">mut</span> <span class="k">Self</span><span class="p">::</span><span class="n">Sut</span><span class="p">,</span>
        <span class="n">transition</span><span class="p">:</span> <span class="k">Self</span><span class="p">::</span><span class="n">Transition</span><span class="p">,</span>
    <span class="p">)</span> <span class="k">-&gt;</span> <span class="k">Self</span><span class="p">::</span><span class="n">State</span> <span class="p">{</span>
        <span class="k">match</span> <span class="n">transition</span> <span class="p">{</span>
            <span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Inc</span> <span class="k">=&gt;</span> <span class="p">{</span>
                <span class="k">if</span> <span class="n">state</span><span class="py">.value</span> <span class="o">&lt;</span> <span class="n">state</span><span class="py">.max_value</span> <span class="p">{</span>
                    <span class="n">state</span><span class="py">.value</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
                    <span class="n">sut</span><span class="py">.value</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="p">}</span>
            <span class="p">}</span>
            <span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Dec</span> <span class="k">=&gt;</span> <span class="p">{</span>
                <span class="n">state</span><span class="py">.value</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
                <span class="n">sut</span><span class="py">.value</span> <span class="o">-=</span> <span class="mi">1</span><span class="p">;</span>
            <span class="p">}</span>
            <span class="nn">CounterTransition</span><span class="p">::</span><span class="n">Reset</span> <span class="k">=&gt;</span> <span class="p">{</span>
                <span class="n">state</span><span class="py">.value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
                <span class="n">sut</span><span class="py">.value</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
            <span class="p">}</span>
        <span class="p">}</span>
        <span class="n">state</span>
    <span class="p">}</span>

    <span class="k">fn</span> <span class="nf">postconditions</span><span class="p">(</span><span class="n">state</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">Self</span><span class="p">::</span><span class="n">State</span><span class="p">,</span> <span class="n">sut</span><span class="p">:</span> <span class="o">&amp;</span><span class="k">Self</span><span class="p">::</span><span class="n">Sut</span><span class="p">)</span> <span class="p">{</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="n">state</span><span class="py">.value</span><span class="p">,</span> <span class="n">sut</span><span class="py">.value</span><span class="p">);</span>
        <span class="nd">assert_eq!</span><span class="p">(</span><span class="n">state</span><span class="py">.max_value</span><span class="p">,</span> <span class="n">sut</span><span class="py">.max_value</span><span class="p">);</span>
        <span class="nd">assert!</span><span class="p">(</span><span class="n">state</span><span class="py">.value</span> <span class="o">&lt;=</span> <span class="n">state</span><span class="py">.max_value</span><span class="p">);</span>
    <span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="the-scalability-problem">The Scalability Problem</h2>

<p>This approach starts simple, but every new operation requires:</p>

<ol>
  <li><strong>Adding a variant</strong> to the central <code class="language-plaintext highlighter-rouge">CounterTransition</code> enum.</li>
  <li><strong>Updating the <code class="language-plaintext highlighter-rouge">apply</code> function</strong> with a new match arm.</li>
  <li><strong>Updating the <code class="language-plaintext highlighter-rouge">transitions</code> function</strong> to include the new operation.</li>
</ol>

<p>With 10 operations, this is manageable. With 100+ operations—like testing a blockchain node—it becomes unwieldy. The <code class="language-plaintext highlighter-rouge">apply</code> function grows into a monolithic match statement. Every developer adding a test command must touch this central file.</p>

<h2 id="real-world-complexity">Real-World Complexity</h2>

<p>Consider testing the Stacks blockchain, where operations include:</p>

<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">StacksTransition</span> <span class="p">{</span>
    <span class="n">MineBitcoinBlock</span><span class="p">,</span>
    <span class="nf">MineStacksBlock</span><span class="p">(</span><span class="n">BlockData</span><span class="p">),</span>
    <span class="nf">SubmitTransaction</span><span class="p">(</span><span class="n">Transaction</span><span class="p">),</span>
    <span class="nf">ConnectPeer</span><span class="p">(</span><span class="n">PeerInfo</span><span class="p">),</span>
    <span class="nf">DisconnectPeer</span><span class="p">(</span><span class="n">PeerId</span><span class="p">),</span>
    <span class="nf">NetworkPartition</span><span class="p">(</span><span class="nb">Vec</span><span class="o">&lt;</span><span class="n">NodeId</span><span class="o">&gt;</span><span class="p">),</span>
    <span class="n">RestoreNetwork</span><span class="p">,</span>
    <span class="nf">RestartNode</span><span class="p">(</span><span class="n">NodeId</span><span class="p">),</span>
    <span class="c1">// ... 50+ more operations.</span>
<span class="p">}</span>
</code></pre></div></div>

<p>The <code class="language-plaintext highlighter-rouge">apply</code> function becomes hundreds of lines. Adding new test scenarios requires editing this central bottleneck. Worse, the logic for each command is scattered—generation in <code class="language-plaintext highlighter-rouge">transitions</code>, preconditions mixed into <code class="language-plaintext highlighter-rouge">apply</code>, and postconditions in a separate function.</p>

<p>This scaling limitation becomes apparent in complex scenarios like reproducing a <a href="https://github.com/stacks-network/stacks-core/pull/5691">Stacks mainnet bug</a>. While that specific attempt used hand-written tests, the enum-based approach would face the same bottleneck—dozens of blockchain operations in a central enum and apply function, making it difficult to generate the chaotic scenarios needed to reveal subtle consensus issues.</p>

<hr />

<p><strong>Next:</strong> <a href="/2025/02/10/scaling-with-madhouse-rs/">Scaling Model-Based Stateful Testing with madhouse-rs</a></p>]]></content><author><name>Nikos Baxevanis</name><email>nikos.baxevanis@gmail.com</email></author><summary type="html"><![CDATA[Part of the Expression Problem in Rust series, demonstrating how proptest-state-machine's enum-based design becomes a bottleneck at scale.]]></summary></entry></feed>