The Expression Problem in Practice: A Trait-Based Testing Harness

This post is part of the Model-Based Stateful Testing with madhouse-rs series.

We started this series with a production bug that couldn’t be reproduced. We end with a framework that not only can catch that bug, but fundamentally change how we think about testing complex systems. The journey reveals practical lessons about the expression problem that extend far beyond testing.

The Design That Emerged

Through trial and error, madhouse-rs converged on a simple but powerful architecture, as described in the whitepaper commit:

Each Command follows a predictable lifecycle:

Generated by a proptest Strategy
Validated via check against current state
Applied via apply, mutating both model and real system
Verified through assertions and postconditions

Why Traits Won Over Enums

The contrast with proptest-state-machine is instructive. Consider how each approach handles a new test operation:

Enum approach (proptest-state-machine):

// 1. Add to the central enum (affects everyone).
enum SystemTransition {
    ExistingOp1,
    ExistingOp2,
    NewOperation(NewOpData), // <- New variant.
}

// 2. Update the central apply function (affects everyone).
fn apply(state: State, transition: SystemTransition) -> State {
    match transition {
        SystemTransition::ExistingOp1 => { /* existing logic */ }
        SystemTransition::ExistingOp2 => { /* existing logic */ }
        SystemTransition::NewOperation(data) => { // <- New arm.
            // New logic scattered across this central function.
        }
    }
}

// 3. Update the transitions function (affects everyone).
fn transitions() -> BoxedStrategy<SystemTransition> {
    prop_oneof![
        existing_strategy_1(),
        existing_strategy_2(),
        new_operation_strategy(), // <- New generator.
    ].boxed()
}

Trait approach (madhouse-rs):

// Self-contained - zero impact on existing code.
struct NewOperationCommand {
    data: NewOpData,
}

impl Command<SystemState, SystemContext> for NewOperationCommand {
    fn check(&self, state: &SystemState) -> bool {
        // Preconditions logic here.
    }

    fn apply(&self, state: &mut SystemState) {
        // Application logic here.
    }

    fn label(&self) -> String {
        format!("NEW_OPERATION({:?})", self.data)
    }

    fn build(ctx: Arc<SystemContext>) -> impl Strategy<Value = CommandWrapper<SystemState, SystemContext>> {
        // Generation strategy here.
        new_operation_strategy()
            .prop_map(|data| CommandWrapper::new(NewOperationCommand { data }))
    }
}

The difference is profound: trait-based commands are autonomous. All logic—generation, preconditions, application, and labeling—lives in one place. No coordination required.

Real-World Scale: The PoX-4 Experience

Before madhouse-rs, we applied these principles with Radu Bahmata to test the Proof-of-Transfer (PoX-4) consensus using TypeScript and fast-check. The harness grew to include 20+ command types, each testing different aspects of the staking protocol:

StackStxCommand - Delegate STX tokens to a stacker
DelegateStxCommand - Delegate stacking rights to a pool
StackAggregationCommitCommand - Commit aggregated stacking transactions
RevokeDelegateStxCommand - Revoke previously delegated stacking rights
StackExtendCommand - Extend an existing stacking commitment
GetStackerInfoCommand - Query stacker information and verify state
… and many, many, more.

The key insight: each command class was self-contained. A developer could add StackExtendCommand without understanding the internals of DelegateStxCommand. The framework composed them automatically.

When a test failed after 200+ operations, the shrinking algorithm would reduce it to something like:

Original sequence: [200+ operations]
Shrunk to: [
    DelegateStx(account, pool),
    StackAggregationCommit(pool, account),
    RevokeDelegateStx(account),
    StackAggregationCommit(pool, account)
]

This four-step sequence revealed a subtle bug: revoking delegation didn’t properly invalidate pending aggregation commits. Finding this manually would have taken weeks.

Lessons for System Design

The expression problem appears everywhere in software design, not just testing frameworks:

1. Plugin Architectures

Want users to extend your system with new functionality? Choose the “data-open” side—make plugins implement traits rather than forcing them to modify central enums.

2. Event Systems

Need to handle dozens of event types? Each event type should be its own struct implementing an Event trait, not variants in a central enum.

3. Command Patterns

Building a command-line tool with subcommands? Each subcommand should be its own type, not a variant in a central enum.

4. Middleware Systems

Web frameworks often choose the “data-open” side: each middleware is its own type implementing a common trait.

The Cost of Getting It Wrong

We’ve seen both sides of this trade-off in practice:

When the enum approach breaks down:

Central files become merge conflict magnets.
Adding new variants requires understanding the entire system.
Logic becomes scattered across multiple functions.
New contributors face a high barrier to entry.

When the trait approach breaks down:

Adding new operations to the trait forces updates everywhere.
Abstract operations are harder to optimize.
Dynamic dispatch can impact performance.
Trait objects introduce complexity.

For madhouse-rs, the trade-off was clear: we needed to add new test operations constantly, but the core operations (check, apply, label, build) were stable. The “data-open” choice was correct.

Performance Considerations

One concern with trait-based approaches is performance. CommandWrapper uses Arc<dyn Command<S, C>>, which involves heap allocation and dynamic dispatch. In our testing scenarios, this overhead was negligible compared to the actual blockchain operations being tested.

The Full Circle

We began with a simple question: how do you design systems that are easy to extend? The expression problem provided the theoretical framework, but the real learning came from building systems that needed to scale.

The Stacks blockchain bug that started this journey taught us that complexity is the enemy of correctness. Traditional testing assumes you can predict where bugs hide. Model-based testing with madhouse-rs assumes you can’t—so it generates the chaos systematically.

The trait-based design made this scalable. Instead of a monolithic test harness that becomes unmaintainable, we have an ecosystem of autonomous commands that compose naturally.

Practical Takeaways

Choose your trade-off consciously: The expression problem forces a choice. Understanding the trade-off helps you pick the right tool.
Favor autonomy at scale: When systems grow large, autonomous components (traits) usually scale better than centralized ones (enums).
Let chaos find the bugs: For complex systems, generated test scenarios often find bugs that manual tests miss.
Design for shrinking: When random tests fail, automatic reduction to minimal cases is invaluable.
Start simple, then scale: Both approaches work for small systems. The difference emerges at scale.

The expression problem isn’t academic theory—it’s a practical design constraint that affects every system you build. Understanding it helps you make better architectural choices, whether you’re building testing frameworks, plugin systems, or distributed applications.

In the end, good design isn’t about avoiding trade-offs. It’s about making them consciously, understanding their implications, and choosing the ones that align with how your system needs to grow.

References and Further Reading

The ideas in this series draw from decades of research and practice:

Series Complete: Model-Based Stateful Testing with madhouse-rs series.