When a Timeout Is Not a NotFound: Reproducing Zebra's Sync Stall in TLA+
A bug that takes fourteen hours to reproduce on mainnet is a bug you cannot
iterate on. Zebra's genesis-to-tip sync stall — where the block download pipeline
froze for minutes in the sandblasting region around two million blocks, resetting
over and over and never reaching tip — was exactly that kind of bug. It was
diagnosed and fixed in
ZcashFoundation/zebra#10679
(symptom thread #5709).
This post shows that the same stall can be reproduced in seconds as a
model-checker counterexample — and that the fix can be shown to close it — with a
small TLA+ spec, sync_scheduler.tla.
The bug
During initial sync a Zebra node downloads blocks from a pool of peers, routed by an inventory registry that records which peer is believed to hold which block. Two defects combined to turn ordinary network hiccups into a stall the node could not sync past:
- Registry poisoning. The router marked a block "missing on peer P" on
any request failure — timeouts and dropped connections included, not just an
explicit
notfound. Once every peer holding a block had been poisoned by a transient error,getdatacould no longer be routed to a peer that had it. - Silent drop. A download that failed with
NotFoundwas discarded instead of re-queued, so a single frontier block stuck at the checkpoint boundary was never re-fetched — wedging the verify pipeline until an eight-minute timeout fired and the node reset, only to wedge again.
These only bite on a degraded peer set, where only a small minority of peers
are at tip; on a healthy network the redundancy hides them. The fix marks
inventory missing only on an explicit notfound, and re-queues NotFound
blocks under a bounded retry count, routed away from the peer that said
notfound. (The fix carries a third change — tolerating a duplicate-queued block
in a batch instead of dropping the rest — that is a batching detail the model
leaves out.)
This is not a wire-protocol bug, so ZIP-204 needs no change. It is an
implementation-level inference error — treating "no response" as "notfound" —
and it lives one layer above the per-connection protocol, in the download
scheduler and its routing registry.
The model
A single syncer downloads blocks 1..MaxBlock from a set of Peers. Ground
truth — which peer actually holds which block — is fixed, but the scheduler
never reads it directly; it learns a peer's contents only through deliveries
and notfound, exactly as a real node does. Two boolean constants select pre- or
post-fix Zebra:
BuggyRouting = TRUE transient error marks the peer "missing" (the poisoning bug)
= FALSE transient error leaves the entry "unknown"
BuggyDrop = TRUE notfound silently drops the block (the wedge bug)
= FALSE notfound re-queues with bounded retries
The whole poisoning bug fits in one line — the registry update on a transient error:
/\ avail' = [ avail EXCEPT ![p][b] = IF BuggyRouting THEN "missing" ELSE "unknown" ]
When BuggyRouting is true, a timeout on a peer that does hold the block writes
"missing" into the registry: a lie the router will believe forever.
The model checks two properties. RegistryHonest (safety) says the registry
never marks a peer missing for a block it holds — the formal statement of "a
timeout is not a notfound." EventuallyAllVerified (liveness,
<>(verified = Blocks)) says every block is eventually verified; the stall is a
violation of this. Strong fairness covers the progress actions, but
TransientError gets none — so the stall must arise from a finite, unlucky
burst of errors, not from errors happening forever.
The counterexample
With the buggy switches on and the smallest interesting peer set — p_tip holds
both blocks, p_lag holds only block 1 — TLC produces a five-state stall:
1. Request block 1 from p_tip
2. Deliver block 1 verified = {1}
3. Request block 2 from p_lag (which does not have it)
4. p_lag returns notfound for block 2 — silently dropped (BuggyDrop)
5. ... stutter forever, verified = {1}, never {1, 2}
The genesis-to-tip stall, in five steps, in under a second:
| Config | Switches | Checks | Result |
|---|---|---|---|
sync_scheduler_poison.cfg | buggy | RegistryHonest | violated — a timeout poisons a holder |
sync_scheduler_buggy.cfg | buggy | EventuallyAllVerified | violated — the stall |
sync_scheduler_fixed.cfg | fixed | invariants + liveness | holds |
Flip both switches to FALSE and the same model, peer set, and adversarial
errors complete the frontier every time. The model distinguishes the buggy and
fixed designs — which is the whole point.
Why bother, since it's already fixed
The value is that the properties generalize. RegistryHonest plus a
download-pipeline progress property — no transient failure can permanently
remove a block from the set of fetchable blocks — is a reusable harness against
the whole class of "transient error poisons routing / silently drops work"
regressions.
More broadly: in a peer-to-peer stack the wire protocol is the easy part. The subtle liveness bugs live in the inference a node makes from incomplete, adversarial information — "no response, therefore the peer doesn't have it." Those hide on healthy networks and surface only under load, and they are exactly what a model checker is good at flushing out, because it will happily find the one unlucky interleaving that production took fourteen hours to stumble into.
Source
- Symptom: zebra#5709 · Fix: zebra#10679
- Model: zcash-p2p-spec#3 —
sync_scheduler.tlaand the full write-up