Chaos Testing stacks-node with Model-Based Stateful Testing

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

Theory is useful, but does the trait-based design actually work in practice? Can it scale to test a real, complex distributed system? The answer came from an unexpected place: a production bug in the Stacks blockchain that refused to be reproduced.

The Bug That Couldn’t Be Caught

In early 2024, Stacks mainnet experienced a stall. After a reorg, miners would occasionally fail to build on their own blocks, disrupting the consensus mechanism. The behavior was intermittent and seemed to depend on precise timing and network conditions.

Core developer Brice Dobry attempted to write a traditional test—a masterfully crafted 533-line integration test with sophisticated setup and coordination:

“In this test, I attempted to reproduce the scenario we saw in mainnet, in which the miner mines a tenure change block in this reorg scenario, but then fails to mine another block building off of that one. I was unable to reproduce that behavior, but this still seems like a useful test to have.”

The test included complex manual orchestration:

Detailed miner setup with specific configuration.
Manual transaction submission and timing coordination.
Explicit waiting periods and state verification.
Hundreds of lines of boilerplate setup code.

Yet even with Brice’s expertise and this carefully crafted test, the production bug remained elusive. This wasn’t a reflection of the test quality—it highlighted just how subtle and context-dependent the bug was. That’s when a radical idea emerged:

“We could shift to a command-based model test. Each step, like ‘miner commits block’ or ‘signer accepts block,’ becomes its own command that updates a small state model and the actual chain. Then we run random sequences of these commands to reveal hidden corners.”

This comment became the genesis of madhouse-rs.

From Idea to Implementation

The insight was profound: instead of trying to predict where bugs might hide, let chaos find them. Model the entire blockchain testing scenario as a collection of autonomous commands, then generate thousands of random sequences.

Here’s how it looked in practice. The test harness included commands like:

Note: The following examples are conceptual illustrations that demonstrate the core patterns. The actual implementation uses more complex blockchain-specific types and operations.

// Each command encapsulates one blockchain operation.
struct MineBitcoinBlockCommand {
    pub block_height: u64,
}

impl Command<StacksState, StacksContext> for MineBitcoinBlockCommand {
    fn check(&self, state: &StacksState) -> bool {
        // Only mine if we're not too far ahead.
        self.block_height <= state.tip_height + 10
    }

    fn apply(&self, state: &mut StacksState) {
        // Update both the model state and the actual blockchain.
        state.bitcoin_blocks.push(self.block_height);
        state.tip_height = self.block_height;

        // Actual blockchain interaction.
        mine_bitcoin_block(self.block_height);

        // Verify post-conditions.
        assert_eq!(get_bitcoin_tip_height(), self.block_height);
    }

    fn label(&self) -> String {
        format!("MINE_BITCOIN_BLOCK({})", self.block_height)
    }

    fn build(ctx: Arc<StacksContext>) -> impl Strategy<Value = CommandWrapper<StacksState, StacksContext>> {
        (ctx.current_height..ctx.current_height + 5)
            .prop_map(|height| CommandWrapper::new(MineBitcoinBlockCommand { block_height: height }))
    }
}

struct SubmitBlockCommitCommand {
    pub miner_id: u32,
    pub bitcoin_block_height: u64,
}

impl Command<StacksState, StacksContext> for SubmitBlockCommitCommand {
    fn check(&self, state: &StacksState) -> bool {
        // Can only commit if the Bitcoin block exists.
        state.bitcoin_blocks.contains(&self.bitcoin_block_height) &&
        state.miners.contains_key(&self.miner_id)
    }

    fn apply(&self, state: &mut StacksState) {
        // Track the commit in model state.
        state.block_commits.push(BlockCommit {
            miner: self.miner_id,
            bitcoin_height: self.bitcoin_block_height,
        });

        // Submit to actual blockchain.
        submit_block_commit(self.miner_id, self.bitcoin_block_height);
    }

    fn label(&self) -> String {
        format!("SUBMIT_COMMIT(miner={}, btc_height={})", self.miner_id, self.bitcoin_block_height)
    }

    fn build(ctx: Arc<StacksContext>) -> impl Strategy<Value = CommandWrapper<StacksState, StacksContext>> {
        let miners = ctx.miners.clone();
        let heights = ctx.available_bitcoin_heights.clone();

        (prop::sample::select(miners), prop::sample::select(heights))
            .prop_map(|(miner, height)| {
                CommandWrapper::new(SubmitBlockCommitCommand {
                    miner_id: miner,
                    bitcoin_block_height: height,
                })
            })
    }
}

The Breakthrough: Real Chaos, Real Bugs

The actual test scenario that finally reproduced the production bug looked like this (from the PR that fixed it, where both Brice’s traditional script and the madhouse-rs approach successfully reproduced the issue):

scenario![
    test_context,
    SkipCommitOpMiner2,
    BootToEpoch3,
    SkipCommitOpMiner1,
    PauseStacksMining,
    MineBitcoinBlock,
    VerifyMiner1WonSortition,
    SubmitBlockCommitMiner2,
    ResumeStacksMining,
    WaitForTenureChangeBlockFromMiner1,
    MineBitcoinBlock,
    VerifyMiner2WonSortition,
    VerifyLastSortitionWinnerReorged,
    WaitForTenureChangeBlockFromMiner2,
    ShutdownMiners
]

But here’s the key: this wasn’t the only sequence tested. When run with MADHOUSE=1, the framework generated thousands of variations:

What if MineBitcoinBlock happened before SubmitBlockCommitMiner2?
What if PauseStacksMining occurred at different points?
What if multiple miners competed in different orders?

One of these chaotic permutations finally triggered the exact conditions that caused the production bug. The test failed, and the framework automatically shrunk the failing sequence to a minimal reproduction case.

The Power of Shrinking

When madhouse-rs found a failing test scenario, it didn’t just report a 200-step chaos sequence. The framework systematically removed operations until it found the minimal case that still triggered the bug:

Original failing sequence: [120 operations...]
Shrunk to: [
    MineBitcoinBlock,
    DisconnectNode(node_2),
    SubmitBlockCommit(miner_1),
    ReconnectNode(node_2)
]

This minimal reproduction became the foundation for understanding and fixing the bug. What would have taken weeks of manual debugging was reduced to a four-step reproduction script.

Why Traditional Testing Failed

The bug existed at the intersection of:

Network timing (when nodes reconnected).
Blockchain state (which blocks were mined when).
Miner behavior (who submitted commits and when).

Traditional integration tests assume you can predict these intersections. They script specific scenarios: “First do X, then Y, then Z.” But production bugs don’t follow scripts—they emerge from the unexpected combinations that nobody thought to test.

Model-based testing with madhouse-rs reverses this assumption: instead of predicting where bugs live, generate the combinations and let the bugs reveal themselves.

The Technical Architecture

The success of this approach depended on the trait-based design:

// The complete test setup.
#[derive(Debug, Default)]
struct StacksTestState {
    bitcoin_blocks: Vec<u64>,
    stacks_blocks: Vec<StacksBlock>,
    miners: HashMap<u32, MinerState>,
    network_partitions: Vec<Partition>,
    // ... dozens more fields tracking blockchain state.
}

impl State for StacksTestState {}

// Context with test parameters.
#[derive(Debug, Clone)]
struct StacksTestContext {
    num_miners: u32,
    bitcoin_block_time: Duration,
    network_delay_range: (Duration, Duration),
    // ... configuration parameters.
}

impl TestContext for StacksTestContext {}

Each command was self-contained. Adding a new blockchain operation—like NetworkPartitionCommand or RestartNodeCommand—required zero changes to existing commands. The trait-based design made it possible to build a test harness with 50+ distinct operations, each developed and tested independently.

The Real-World Impact

This wasn’t just an academic exercise. The chaos testing approach:

Found the production bug that traditional testing missed.
Provided a minimal reproduction case for debugging.
Validated the fix by running thousands of variations to ensure the bug was truly resolved.
Enabled ongoing regression testing with the same chaos generation.

The framework runs in CI, continuously generating new chaotic scenarios to catch regressions before they reach production.

From Chaos to Confidence

The lesson isn’t that traditional testing is worthless—it’s that certain classes of bugs only emerge from chaos. Race conditions, timing issues, and complex state interactions hide in the combinations that manual tests never explore.

Model-based testing with madhouse-rs turns chaos into a systematic testing strategy. The trait-based design makes it sustainable at scale. The automatic shrinking makes failures actionable.

This is how we can move from “I was unable to reproduce that behavior” to reproducible test cases that can catch production bugs before they happen.

Next: The Expression Problem in Practice: A Trait-Based Testing Harness