Relay-BP qLDPC Decoder on FPGA: Design Report

1. The problem and the algorithm

A quantum error-correcting code is decoded every measurement cycle: the hardware reads a syndrome (which checks fired) and must infer the most likely set of errors fast enough that the backlog never grows. For superconducting qubits that budget is on the order of a microsecond per round, so the decoder is a hard real-time classical machine, not an offline solver.

Plain belief propagation, the workhorse of classical LDPC decoding, stalls on quantum codes: the stabilizer structure creates symmetric trapping sets where beliefs oscillate and never settle. Relay-BP (IBM's "Relay-ensembling with locally-averaged memory") adds three ideas that break that symmetry.

Algorithm mechanics: Tanner graph and the three Relay-BP ideas — The three ideas layered on belief propagation: a disordered memory bias, relaying between legs, and ensembling the best converged solution.

The memory bias blends each variable's prior with its previous belief by a per-node strength that may be negative; relaying carries beliefs from one leg to the next while redrawing those strengths; ensembling keeps the lowest-weight valid solution and stops once one converges. Empirically this beats general-purpose post-processing decoders by about an order of magnitude in logical error rate on the gross code [IBM, arXiv:2506.01779].

2. The hardware architecture

The reference silicon decoder is fully parallel: one compute unit per graph node and one wire per edge, so an entire belief-propagation iteration completes in two clock cycles. There is no central message memory; the messages live in the wiring fabric itself.

Fully-parallel decoder architecture, 2-cycle iteration — Two-cycle loop: check-node units emit a compressed per-check output; variable-node units reconstruct the exclusive minimum, accumulate, and apply the memory bias; a syndrome check drives early termination.

Two encodings keep that wiring affordable. Messages are carried in sign-magnitude form, so the magnitude is free of a sign bit on every edge. And each check node emits a compressed output (a sign, a selector, and the two smallest magnitudes) rather than a value per edge; the per-edge exclusive minimum is finished inside the variable node. Both shrink the bus that dominates the design.

3. The four compute units, up close

Everything that follows (the wiring ceiling, the lever map, the early-termination win) is a property of four small units repeated across the graph. This section is the foundation: each unit is shown at the datapath level, exactly as it is generated, so the later analysis can point back to a concrete circuit rather than a metaphor.

3.1 Check-node unit: the min-sum core

One per check. It splits each incoming sign-magnitude message, feeds the magnitudes into a two-minimum merge tree that finds the smallest and second-smallest plus the index of the smallest, and XORs the signs (against the measured syndrome bit) to a parity. Only that compressed tuple (parity, selector, two magnitudes) leaves the node. The exclusive-minimum-per-edge is deliberately not finished here; deferring it is what keeps the output bus narrow, and it is the single biggest reason the wiring stays affordable.

Check-node micro-architecture: two-min merge tree and compressed output — Check-node unit: sign-magnitude split, a two-minimum merge tree (min1, min2, selector), sign XOR to parity, and a compressed four-field output to the variable nodes.

3.2 Variable-node unit: reconstruct and update

One per error bit. It reconstructs each check-to-variable message from the compressed tuple (picking the second-smallest magnitude when this edge is the selected minimum and the smallest otherwise, with the sign rebuilt from the parity), then sums all incoming messages with the prior, saturates, and emits a forward-backward update per edge (total minus this edge) plus a hard decision. The reconstruct MUX is the piece that pays back the check node's compression.

Variable-node micro-architecture: reconstruct MUX, forward-backward accumulate, hard decision — Variable-node unit: a selector MUX reconstructs the deferred exclusive-minimum, a sum-and-saturate accumulates with the prior, and a forward-backward subtract produces each outgoing message plus the hard decision.

3.3 Memory-bias unit: the disordered-memory prior

This is the Relay-BP innovation, one instance per variable. It blends the channel prior with the last belief through a single signed multiply and an arithmetic shift (prior + strength × (belief − prior) >> M), and a pinned node passes its prior straight through. The per-node strength can be negative; that is precisely what lets the decoder escape the symmetric stalls that trap plain belief propagation, and it is why §1's "memory bias" is cheap in hardware.

Memory-bias micro-architecture: single multiply, arithmetic shift, pinned passthrough — Memory-bias unit: difference, one signed multiply by the per-node strength, arithmetic shift, add to the prior, and a pinned-node passthrough MUX.

3.4 Relay and early-termination control: the outer loop

The control wraps the BP core. A leg counter indexes a memory-strength table (one row of per-node strengths per leg) that feeds the memory-bias unit; the belief carries forward from one leg to the next. A syndrome popcount counts the failing checks and drives two registers: a keep-best register that remembers the lowest-weight attempt (the ensembling), and a stall counter that ends a leg the moment progress stops for a patience window (the early termination). A cleared syndrome ends decoding immediately. This unit is where the optimization in §5 lives: the latency lever is the stall path drawn here.

Relay/ensemble and early-termination control micro-architecture — Control: a per-leg memory-strength table feeds the bias unit; a syndrome popcount drives a keep-best register (ensembling) and a stall counter (early termination), with a feedback path that starts the next leg carrying the belief.

Read the rest of the report against these four. The wiring ceiling (§4) is the cost of carrying the check node's compressed bus across the die; the lever map (§5) acts on the control unit's stall path and the widths inside the merge tree; correctness (§6) is bit-exactness checked at every one of these datapaths.

4. Results and the wiring ceiling

Measured on the mid-scale distance-5 problem (a synthesizable proxy for the gross code) and compared against the published gross-code silicon:

Configuration	LUT	FF	Clock	Provenance
This work, distance-5, fully parallel	476,318	105,457	~85.7 MHz	MEASURED post-route, place-directive median
This work, gross extrapolation	~2.2M	-	-	ESTIMATE linear scale from distance-5
Published reference, gross, in silicon	2,106,738	540,767	~83 MHz	arXiv:2510.21600 (XCVU19P, 58 W, 24 ns/iter)

Critical path is mostly routing; lands on the published envelope — About 70% of each critical path is moving messages across the die, not computing. The clock and resource use land on the published silicon envelope.

The clock is bounded by wiring, not logic: roughly 70% of the critical path is route delay across the die, an intrinsic property of the expander-style code graph. That is why the result matches the reference envelope rather than beating it: the wiring is the wall, and it is the same wall for everyone building this fully-parallel.

So what: at a fixed, wiring-bound clock, decode latency is set by iteration count, not by squeezing the clock. That reframes where the real optimization lives.

5. The optimization: real levers vs dead ends

Decode latency is iterations multiplied by cycles-per-iteration, divided by clock. With the clock pinned by routing, the honest map of what moved latency is short, and several intuitive levers actively hurt.

Optimization lever map: which levers moved latency — Measured outcome of each lever on the distance-5 decoder. Two helped; three were dead ends for a latency-first design.

Choosing the right number precision cut iterations to a quarter (a smaller width converged far slower, so its area saving was a latency trap). Pipelining the comparison tree raised the clock but added a cycle per iteration, making per-decode latency worse. Time-multiplexing the compute could not meet the microsecond budget. Floorplanning the layout into fewer die regions traded crossing delay for congestion and regressed timing.

The lever that worked is algorithmic: end an attempt as soon as it stops making progress.

Early termination: worst-case iterations down 32%, accuracy improves — Early termination on a stalled check count cuts worst-case (p95) iterations by about a third, and logical accuracy improves because each fresh relay explores a new path.

Early termination cut worst-case iterations by ~32% with no accuracy loss. In fact the logical error rate improved (each fresh relay explores a new solution path). In hardware it costs a popcount and a small counter: about +0.5% area and ~2 MHz, a clear trade for a latency-first decoder. [iteration counts from the cycle-accurate model; RTL bit-exact; +0.47% LUT / ~-2 MHz measured post-route]

6. How correctness was assured

Every layer is held to the one below by bit-exact equality, so an optimization can never silently change the math.

Math reference → cycle-accurate model → RTL. The math model is validated against the independent reference; the cycle-accurate model is validated against the math; the RTL is checked cycle-for-cycle against the model. All algorithm changes are made in the model first, then translated.
Measure, do not compute. Pipeline depths and iteration counts are read from simulation, never estimated, because estimates were repeatedly wrong by a few cycles in ways that corrupt convergence.
Hardest input first. Low-precision and convergence claims are tested on the hardest realistic inputs, not the easy all-zero case that hides sign and saturation bugs.
Negative results are first-class. A fully-parallel gross build is not synthesizable on a generic flow; that was recorded, not hidden, and motivated the time-multiplexed path reserved for that scale.

7. Guidelines that transfer

On a wiring-bound design, the clock is a property of the graph, not the logic; spend effort on fewer iterations or fewer wires, not on pipelining or floorplanning.
Lower precision is only a win if it does not cost iterations; validate the iteration count, not just the area.
Clock and per-decode latency are different objectives. A change that raises the clock can make latency worse if it adds a cycle per iteration.
Validate any early-stopping policy on logical error rate at scale, not just convergence: the effect is subtle and can go either way.

8. Conclusion

The decoder, generated from a model proven equal to the math at every layer, sits on the published silicon envelope for the gross code: same clock regime, same resource scale. It is not faster because the wall is the code graph's wiring, which is the same for any fully-parallel implementation. The transferable contribution is the early-termination lever, a measured ~32% cut in worst-case latency at almost no area cost, validated to improve rather than degrade accuracy. Beyond this point, the levers are algorithmic (fewer iterations) or platform-level (a looser latency budget unlocking time-multiplexed area savings), not RTL tuning.