The Bugs Between Calls

Anthropic just showed how far property-based testing can go when you can express a property at a function boundary. Their agent generated Hypothesis tests for real-world libraries and validated/reported several bugs in NumPy, Pandas, and SciPy.

One important gap is that, while we still see critical bugs in single calls (SSZ decoding, BLS edge cases), many of the most expensive recent failures live between calls.

In many cases, the protocol rules are fine; the failure is a valid-but-expensive trace that turns into a liveness incident under load.

December 2025: Shortly after Fusaka activated (Dec 3, 2025), Prysm hit a resource-exhaustion path processing certain attestations, dropping network participation to ~75% and pushing voting participation as low as ~74.7% in some epochs—uncomfortably close to the 2/3 stake threshold required for finality. In this incident, attestations referencing a previous-epoch block root could trigger repeated state recreation, replay, and epoch-transition recomputation, exhausting node resources under load. (See the Prysm mainnet postmortems for the primary write-up.)

May 2023: Mainnet finality was delayed twice within ~24 hours (first ~4 epochs, then ~9). The trigger was valid old-target attestations that forced expensive beacon-state regeneration in some clients; diversity helped the chain recover without intervention. (Postmortem: Ethereum Mainnet Finality Incident (May 2023).)

April 2023: Stacks hit a PoX-2 bug in stack-increase that impacted Stacking rewards for a cycle. The details are different, but the shape is familiar: stateful logic where correctness is about how an accounting state evolves over a sequence of actions, not a single call in isolation. I wasn’t at Stacks at the time, and I’m not claiming “I would have caught it” — I mention it because it’s a clean example of why tests that exercise traces (not just inputs) matter. (Thread: A bug in stacks-increase call is impacting Stacking rewards this cycle.)

Each operation was valid. The sequence proved problematic only under load.

Many expensive bugs live in sequences that look fine individually.

Stateless properties (and where they stop)

Stateless properties shine when:

the function boundary is the correctness boundary
behavior is local to a single invocation
invariants don’t depend on history

This covers a lot of “pure-ish” code: parsing, formatting, serialization, numerical edge cases.

But consensus software is not primarily pure functions.

Consensus clients are state machines

Ethereum’s consensus clients (e.g., Lighthouse, Prysm, Teku, Grandine, Nimbus, Lodestar) implement a long-lived state machine:

per-slot processing (slots advance, duties change, messages arrive out of order)
data availability (verifying that required data is available; evolving toward Data Availability Sampling via PeerDAS)
fork choice (multiple competing branches, attestation-weighted via LMD-GHOST)
finality (justified/finalized checkpoints that must only move forward)
storage and replay (idempotence, witness caching, pruning, reorgs)

Correctness is rarely “the output of one function call”. It’s “the system’s behavior over a trace”.

Examples of stateful invariants in consensus clients

Here are a few invariants that are naturally history-dependent:

Finality is monotonic: the finalized checkpoint’s epoch must never decrease. (Finality can stall; it must not regress.)
Fork choice respects finality: once a checkpoint is finalized, the selected head must be a descendant of it. (Heads can reorg; finalized history cannot.)
Data Availability gates what validators can accept/vote for: A block header is not enough in 2026. Availability is enforced via fork-choice/voting rules: validators should only accept and vote for blocks once sufficient data availability has been verified (today: all blobs; with PeerDAS: sampling cells/columns). Fork choice can only safely give full weight to blocks that validators can legally vote for. Testing the transition from “pending availability” to “available and valid” is a classic stateful trace.
Stake-weighted participation (MaxEB): Participation was always stake-weighted, but MaxEB makes the variance visible by raising the cap from 32 ETH to 2048 ETH. Not every validator will immediately sit at the cap, but stake weight per validator can now vary widely, so a bug that affects a handful of high-effective-balance validators can represent outsized stake impact.
Idempotent imports: importing the same block twice (same root) must not double-apply side effects (DB indexes, caches, votes, metrics, etc.).
Equivocations must be handled, not assumed away: you can see multiple distinct blocks for the same slot. A client shouldn’t panic or corrupt state just because reality is adversarial.
Epoch-boundary logic must not run twice: “do X once per epoch” bugs are classic state-machine failures when reorgs, retries, and partial persistence meet.

If you try to phrase these as “(f(x)) preserves (P)”, you end up smuggling “history” into (x) until it stops being a useful boundary.

Take “finality is monotonic.” You might try:

for all (old_finalized, new_finalized):
    process_block(...) implies new_finalized >= old_finalized

But now old_finalized is part of the input. Where does it come from? You have to generate it. And to generate a valid old state, you need to know what sequence of blocks led there. You’ve just reinvented traces—badly.

The honest framing is: “after any valid sequence of operations, the finalized epoch never decreases”. That’s a property over traces, not over inputs.

Model-based, stateful property-based testing

Stateful testing makes the history explicit:

State --(Command)--> State'

Instead of generating inputs for a single call, you generate commands and run them as a scenario. The bug is often not in any single step, but in a particular ordering of steps.

This idea is old and battle-tested (QuickCheck state machine testing, Hedgehog, proptest-state-machine), but to my knowledge still underused in many production systems.

The same approach, built into madhouse-rs, caught a production bug in the Stacks blockchain that traditional testing missed. A 533-line integration test failed to reproduce it. A chaotic command sequence succeeded.

Model-based, stateful testing has been applied successfully to production systems like the Stacks PoX contracts. The approach proved practical for ongoing use, helping catch issues that traditional testing methods missed and demonstrating the value of stateful property testing in complex consensus systems.

A minimal Rust harness (the “boring runner”)

The core trick is to keep the runner boring and put all the logic in commands. This is the same shape that scales in practice.

State and context

pub trait State: std::fmt::Debug {}

pub trait TestContext: std::fmt::Debug + Clone {}

For the examples below, assume an empty context:

#[derive(Debug, Clone, Default)]
pub struct BeaconContext;
impl TestContext for BeaconContext {}

Commands

use proptest::prelude::*;
use std::sync::Arc;

pub trait Command<S: State, C: TestContext>:
    std::fmt::Debug + Send + Sync
{
    // Precondition: is this command meaningful *now*.
    fn check(&self, state: &S) -> bool;

    // Apply the transition and assert postconditions.
    fn apply(&self, state: &mut S);

    // For debugging and shrunk traces.
    fn label(&self) -> String;

    // Generate commands.
    fn build(ctx: Arc<C>)
        -> impl Strategy<Value = CommandWrapper<S, C>>
    where
        Self: Sized;
}

Wrapper for heterogeneous sequences

#[derive(Clone)]
pub struct CommandWrapper<S: State, C: TestContext> {
    pub command: Arc<dyn Command<S, C>>,
}

impl<S: State, C: TestContext> CommandWrapper<S, C> {
    pub fn new<T>(t: T) -> Self
    where
        T: Command<S, C> + 'static,
    {
        Self { command: Arc::new(t) }
    }
}

Execution loop

pub fn execute_commands<S: State, C: TestContext>(
    commands: &[CommandWrapper<S, C>],
    state: &mut S,
) {
    for cmd in commands {
        if cmd.command.check(state) {
            cmd.command.apply(state);
        }
    }
}

The point is locality: generation, preconditions, transition logic, and invariants live together. That design choice is exactly the “data-open” side of the expression problem, and it’s why these harnesses survive contact with real systems.

A consensus-client-flavored example (with correct slot semantics)

One easy trap is to assume “there is only one block per slot”. In the spec there is one proposer per slot, but on the network you can see:

equivocations (two blocks for the same slot from the proposer)
different views due to propagation delays
reorgs that temporarily make a “worse” chain the head

So a stateful invariant should not be “reject a second block at slot (s)”. That’s not how fork choice works.

Instead, here’s a deliberately small example that matches real failure modes: idempotence by block root. If a client re-imports the same block (same root), it must not double-apply side effects.

Model state

use std::collections::{HashMap, HashSet};

#[derive(Debug, Default)]
struct BeaconModel {
    current_slot: u64,

    // Slot -> set of known block roots at that slot (forks allowed).
    known_by_slot: HashMap<u64, HashSet<[u8; 32]>>,

    // In 2026, participation is stake-weighted (MaxEB / EIP-7251).
    // Total weight of unique blocks we've imported.
    total_imported_weight: u64,

    // Track which block states are available to prevent the 2025 Prysm regression
    // (expensive state regeneration when validating attestations for uncached blocks).
    state_cache: HashSet<[u8; 32]>,
}
impl State for BeaconModel {}

Command: tick time

#[derive(Debug)]
struct TickSlot;

impl Command<BeaconModel, BeaconContext> for TickSlot {
    fn check(&self, _state: &BeaconModel) -> bool { true }

    fn apply(&self, state: &mut BeaconModel) {
        state.current_slot += 1;
    }

    fn label(&self) -> String { "TICK_SLOT".to_string() }
}

Command: import a block (stake-weighted, duplicates forbidden)

#[derive(Debug)]
struct ImportBlock {
    slot: u64,
    root: [u8; 32],
    weight: u64, // Stake-weighted via MaxEB.
}

impl Command<BeaconModel, BeaconContext> for ImportBlock {
    fn check(&self, state: &BeaconModel) -> bool {
        self.slot <= state.current_slot
    }

    fn apply(&self, state: &mut BeaconModel) {
        let entry = state
            .known_by_slot
            .entry(self.slot)
            .or_default();
        let is_new = entry.insert(self.root);

        // This is the invariant: same root must not be "new" twice.
        // Stake-weighting means a duplicate root shouldn't double-count weight.
        if is_new {
            state.total_imported_weight += self.weight;
            state.state_cache.insert(self.root);
        }
    }

    fn label(&self) -> String {
        format!("IMPORT_BLOCK(slot={}, weight={})", self.slot, self.weight)
    }
}

// The bug from December 2025: attestations for stale blocks 
// triggering expensive state regeneration.
#[derive(Debug)]
struct ProcessAttestation {
    slot: u64,
    block_root: [u8; 32],
}

impl Command<BeaconModel, BeaconContext> for ProcessAttestation {
    fn check(&self, state: &BeaconModel) -> bool {
        self.slot <= state.current_slot
    }

    fn apply(&self, state: &mut BeaconModel) {
        // Invariant: looking up state for an attestation must not 
        // cause a "miss" that triggers an expensive re-play.
        assert!(
            state.state_cache.contains(&self.block_root),
            "State cache miss for block {:?} at slot {}", 
            self.block_root, self.slot
        );
    }

    fn label(&self) -> String {
        format!("PROCESS_ATTESTATION(slot={})", self.slot)
    }
}

If the implementation accidentally increments counters, updates indexes, or applies cached transitions twice on duplicate import, a failing trace usually shrinks to something like:

[
  TICK_SLOT,
  IMPORT_BLOCK(slot=1, root=R),
  IMPORT_BLOCK(slot=1, root=R),
]

If the system is not truly idempotent, stateful testing will reduce a complex failure down to the smallest possible sequence—often just “process the same thing twice”—making the bug obvious and undeniable.

That is the shape of a lot of consensus-client failures: not “wrong return value”, but “the second time through a path, something subtle breaks”.

Stateless + stateful is the real combination

You want both:

Stateless property-based testing (PBT) for pure-ish components: SSZ encoding, BLS wrappers, serialization, bitfields. This is where Anthropic’s approach shines.
Stateful PBT for the hard parts: fork choice, finality logic, DB/replay, reorg handling, epoch boundaries.

Stateless PBT finds bugs in the bricks (SSZ, BLS). Stateful PBT finds bugs in how the bricks stack—especially in the high-stakes world of PeerDAS and stake-weighted participation.

Anthropic showed us how to check the mortar really well. This post is about the wall.