Anthropic just showed how far property-based testing can go when you can express a property at a function boundary. Their agent generated Hypothesis tests for real-world libraries and validated/reported several bugs in NumPy, Pandas, and SciPy.
But here’s the gap: while we still see critical bugs in single calls (SSZ decoding, BLS edge cases), the most expensive recent failures live between calls.
Often the protocol rules are fine; the failure is a valid-but-expensive trace that turns into a liveness incident under load.
December 2025: Shortly after Fusaka activated (Dec 3, 2025), Prysm hit a resource-exhaustion path processing certain attestations, dropping network participation to ~75% and pushing voting participation as low as ~74.7% in some epochs—uncomfortably close to the 2/3 stake threshold required for finality. The cause? Attestations referencing a previous-epoch block root forced repeated state recreation, replay, and epoch-transition recomputation, exhausting node resources under load. (See the Prysm mainnet postmortems for the primary write-up.)
May 2023: Mainnet finality was delayed twice within ~24 hours (first ~4 epochs, then ~9). The trigger was valid old-target attestations that forced expensive beacon-state regeneration in some clients; diversity helped the chain recover without intervention. (Postmortem: Ethereum Mainnet Finality Incident (May 2023).)
Each operation was valid. The sequence was catastrophic.
The expensive bugs live in sequences that look fine individually.
Stateless properties shine when:
This covers a lot of “pure-ish” code: parsing, formatting, serialization, numerical edge cases.
But consensus software is not primarily pure functions.
Ethereum’s consensus clients (e.g., Lighthouse, Prysm, Teku, Grandine, Nimbus, Lodestar) implement a long-lived state machine:
Execution clients have the same shape, with even more complexity at the Engine API boundary (e.g., asynchrony between newPayload and forkchoiceUpdated). I’m focusing on the consensus side here because it’s the primary arena for PeerDAS and stake-weighted participation.
Correctness is rarely “the output of one function call”. It’s “the system’s behavior over a trace”.
Here are a few invariants that are naturally history-dependent:
If you try to phrase these as “(f(x)) preserves (P)”, you end up smuggling “history” into (x) until it stops being a useful boundary.
Take “finality is monotonic.” You might try:
for all (old_finalized, new_finalized):
process_block(...) implies new_finalized >= old_finalized
But now old_finalized is part of your input. Where does it come from? You have to generate it. And to generate a valid old state, you need to know what sequence of blocks led there. You’ve just reinvented traces—badly.
The honest framing is: “after any valid sequence of operations, the finalized epoch never decreases”. That’s a property over traces, not over inputs.
Stateful testing makes the history explicit:
State --(Command)--> State'
Instead of generating inputs for a single call, you generate commands and run them as a scenario. The bug is often not in any single step, but in a particular ordering of steps.
This idea is old and battle-tested (QuickCheck state machine testing, Hedgehog, proptest-state-machine), but underused everywhere.
The same approach, built into madhouse-rs, caught a production bug in the Stacks blockchain that traditional testing missed. A 533-line integration test failed to reproduce it. A chaotic command sequence succeeded.
The core trick is to keep the runner boring and put all the logic in commands. This is the same shape that scales in practice.
pub trait State: std::fmt::Debug {}
pub trait TestContext: std::fmt::Debug + Clone {}
For the examples below, assume an empty context:
#[derive(Debug, Clone, Default)]
pub struct BeaconContext;
impl TestContext for BeaconContext {}
use proptest::prelude::*;
use std::sync::Arc;
pub trait Command<S: State, C: TestContext>:
std::fmt::Debug + Send + Sync
{
// Precondition: is this command meaningful *now*.
fn check(&self, state: &S) -> bool;
// Apply the transition and assert postconditions.
fn apply(&self, state: &mut S);
// For debugging and shrunk traces.
fn label(&self) -> String;
// Generate commands.
fn build(ctx: Arc<C>)
-> impl Strategy<Value = CommandWrapper<S, C>>
where
Self: Sized;
}
#[derive(Clone)]
pub struct CommandWrapper<S: State, C: TestContext> {
pub command: Arc<dyn Command<S, C>>,
}
impl<S: State, C: TestContext> CommandWrapper<S, C> {
pub fn new<T>(t: T) -> Self
where
T: Command<S, C> + 'static,
{
Self { command: Arc::new(t) }
}
}
pub fn execute_commands<S: State, C: TestContext>(
commands: &[CommandWrapper<S, C>],
state: &mut S,
) {
for cmd in commands {
if cmd.command.check(state) {
cmd.command.apply(state);
}
}
}
The point is locality: generation, preconditions, transition logic, and invariants live together. That design choice is exactly the “data-open” side of the expression problem, and it’s why these harnesses survive contact with real systems.
One easy trap is to assume “there is only one block per slot”. In the spec there is one proposer per slot, but on the network you can see:
So a stateful invariant should not be “reject a second block at slot (s)”. That’s not how fork choice works.
Instead, here’s a deliberately small example that matches real failure modes: idempotence by block root. If a client re-imports the same block (same root), it must not double-apply side effects.
use std::collections::{HashMap, HashSet};
#[derive(Debug, Default)]
struct BeaconModel {
current_slot: u64,
// Slot -> set of known block roots at that slot (forks allowed).
known_by_slot: HashMap<u64, HashSet<[u8; 32]>>,
// In 2026, participation is stake-weighted (MaxEB / EIP-7251).
// Total weight of unique blocks we've imported.
total_imported_weight: u64,
// Track which block states are available to prevent the 2025 Prysm regression
// (expensive state regeneration when validating attestations for uncached blocks).
state_cache: HashSet<[u8; 32]>,
}
impl State for BeaconModel {}
#[derive(Debug)]
struct TickSlot;
impl Command<BeaconModel, BeaconContext> for TickSlot {
fn check(&self, _state: &BeaconModel) -> bool { true }
fn apply(&self, state: &mut BeaconModel) {
state.current_slot += 1;
}
fn label(&self) -> String { "TICK_SLOT".to_string() }
}
#[derive(Debug)]
struct ImportBlock {
slot: u64,
root: [u8; 32],
weight: u64, // Stake-weighted via MaxEB.
}
impl Command<BeaconModel, BeaconContext> for ImportBlock {
fn check(&self, state: &BeaconModel) -> bool {
self.slot <= state.current_slot
}
fn apply(&self, state: &mut BeaconModel) {
let entry = state
.known_by_slot
.entry(self.slot)
.or_default();
let is_new = entry.insert(self.root);
// This is the invariant: same root must not be "new" twice.
// Stake-weighting means a duplicate root shouldn't double-count weight.
if is_new {
state.total_imported_weight += self.weight;
state.state_cache.insert(self.root);
}
}
fn label(&self) -> String {
format!("IMPORT_BLOCK(slot={}, weight={})", self.slot, self.weight)
}
}
// The bug from December 2025: attestations for stale blocks
// triggering expensive state regeneration.
#[derive(Debug)]
struct ProcessAttestation {
slot: u64,
block_root: [u8; 32],
}
impl Command<BeaconModel, BeaconContext> for ProcessAttestation {
fn check(&self, state: &BeaconModel) -> bool {
self.slot <= state.current_slot
}
fn apply(&self, state: &mut BeaconModel) {
// Invariant: looking up state for an attestation must not
// cause a "miss" that triggers an expensive re-play.
assert!(
state.state_cache.contains(&self.block_root),
"State cache miss for block {:?} at slot {}",
self.block_root, self.slot
);
}
fn label(&self) -> String {
format!("PROCESS_ATTESTATION(slot={})", self.slot)
}
}
If your implementation accidentally increments counters, updates indexes, or applies cached transitions twice on duplicate import, a failing trace usually shrinks to something like:
[
TICK_SLOT,
IMPORT_BLOCK(slot=1, root=R),
IMPORT_BLOCK(slot=1, root=R),
]
Each step is valid. The bug is in the interaction.
That is the shape of a lot of consensus-client failures: not “wrong return value”, but “the second time through a path, something subtle breaks”.
You want both:
Stateless PBT finds bugs in the bricks (SSZ, BLS). Stateful PBT finds bugs in how the bricks stack—especially in the high-stakes world of PeerDAS and stake-weighted participation.
Anthropic showed us how to check the mortar really well. This post is about the wall.