When a Timeout Is Not a NotFound: Reproducing Zebra's Sync Stall in TLA+

A bug that takes fourteen hours to reproduce on mainnet is a bug you cannot iterate on. Zebra's genesis-to-tip sync stall — where the block download pipeline froze for minutes in the sandblasting region around two million blocks, resetting over and over and never reaching tip — was exactly that kind of bug. It was diagnosed and fixed in ZcashFoundation/zebra#10679 (symptom thread #5709). This post shows that the same stall can be reproduced in seconds as a model-checker counterexample — and that the fix can be shown to close it — with a small TLA+ spec, sync_scheduler.tla.


The bug

During initial sync a Zebra node downloads blocks from a pool of peers, routed by an inventory registry that records which peer is believed to hold which block. Two defects combined to turn ordinary network hiccups into a stall the node could not sync past:

  1. Registry poisoning. The router marked a block "missing on peer P" on any request failure — timeouts and dropped connections included, not just an explicit notfound. Once every peer holding a block had been poisoned by a transient error, getdata could no longer be routed to a peer that had it.
  2. Silent drop. A download that failed with NotFound was discarded instead of re-queued, so a single frontier block stuck at the checkpoint boundary was never re-fetched — wedging the verify pipeline until an eight-minute timeout fired and the node reset, only to wedge again.

These only bite on a degraded peer set, where only a small minority of peers are at tip; on a healthy network the redundancy hides them. The fix marks inventory missing only on an explicit notfound, and re-queues NotFound blocks under a bounded retry count, routed away from the peer that said notfound. (The fix carries a third change — tolerating a duplicate-queued block in a batch instead of dropping the rest — that is a batching detail the model leaves out.)

This is not a wire-protocol bug, so ZIP-204 needs no change. It is an implementation-level inference error — treating "no response" as "notfound" — and it lives one layer above the per-connection protocol, in the download scheduler and its routing registry.

The model

A single syncer downloads blocks 1..MaxBlock from a set of Peers. Ground truth — which peer actually holds which block — is fixed, but the scheduler never reads it directly; it learns a peer's contents only through deliveries and notfound, exactly as a real node does. Two boolean constants select pre- or post-fix Zebra:

BuggyRouting = TRUE   transient error marks the peer "missing"  (the poisoning bug)
             = FALSE  transient error leaves the entry "unknown"

BuggyDrop    = TRUE   notfound silently drops the block         (the wedge bug)
             = FALSE  notfound re-queues with bounded retries

The whole poisoning bug fits in one line — the registry update on a transient error:

/\ avail' = [ avail EXCEPT ![p][b] = IF BuggyRouting THEN "missing" ELSE "unknown" ]

When BuggyRouting is true, a timeout on a peer that does hold the block writes "missing" into the registry: a lie the router will believe forever.

The model checks two properties. RegistryHonest (safety) says the registry never marks a peer missing for a block it holds — the formal statement of "a timeout is not a notfound." EventuallyAllVerified (liveness, <>(verified = Blocks)) says every block is eventually verified; the stall is a violation of this. Strong fairness covers the progress actions, but TransientError gets none — so the stall must arise from a finite, unlucky burst of errors, not from errors happening forever.

The counterexample

With the buggy switches on and the smallest interesting peer set — p_tip holds both blocks, p_lag holds only block 1 — TLC produces a five-state stall:

1. Request block 1 from p_tip
2. Deliver block 1                   verified = {1}
3. Request block 2 from p_lag        (which does not have it)
4. p_lag returns notfound for block 2 — silently dropped (BuggyDrop)
5. ... stutter forever, verified = {1}, never {1, 2}

The genesis-to-tip stall, in five steps, in under a second:

ConfigSwitchesChecksResult
sync_scheduler_poison.cfgbuggyRegistryHonestviolated — a timeout poisons a holder
sync_scheduler_buggy.cfgbuggyEventuallyAllVerifiedviolated — the stall
sync_scheduler_fixed.cfgfixedinvariants + livenessholds

Flip both switches to FALSE and the same model, peer set, and adversarial errors complete the frontier every time. The model distinguishes the buggy and fixed designs — which is the whole point.

Why bother, since it's already fixed

The value is that the properties generalize. RegistryHonest plus a download-pipeline progress property — no transient failure can permanently remove a block from the set of fetchable blocks — is a reusable harness against the whole class of "transient error poisons routing / silently drops work" regressions.

More broadly: in a peer-to-peer stack the wire protocol is the easy part. The subtle liveness bugs live in the inference a node makes from incomplete, adversarial information — "no response, therefore the peer doesn't have it." Those hide on healthy networks and surface only under load, and they are exactly what a model checker is good at flushing out, because it will happily find the one unlucky interleaving that production took fourteen hours to stumble into.

Source