Putting a quantum error-correction decoder on an FPGA

AlgoSilicon engineering · 2026

A fault-tolerant quantum computer reads an error signal every measurement cycle and must infer the most likely physical errors before the next cycle arrives. For superconducting qubits that budget is on the order of a microsecond per round. The decoder is therefore a hard real-time classical machine, not an offline solver, and it is one of the things standing between today's lab machines and a useful one.

This is the story of re-building IBM's Relay-BP decoder for the [[144,12,12]] gross code, from the algorithm down to placed-and-routed silicon, with our algorithm-to-hardware flow. The honest result up front: it lands on the same silicon envelope as the published reference, and the genuine contribution is a clean map of which optimizations move the needle and one algorithmic lever that does.

What the decoder actually does

Plain belief propagation is the workhorse of classical error-correction decoding: every check shares what it knows with every error bit, the bits update their beliefs, and after a few rounds the most likely error pattern falls out. On quantum codes it stalls. The structure of the code creates symmetric traps where the beliefs oscillate and never settle. Relay-BP adds three ideas that break the symmetry.

Belief propagation on the code graph plus the three Relay-BP ideas: a memory bias that blends each bit's prior with its last belief by a per-node strength, relaying that carries beliefs from one attempt to the next with fresh strengths, and ensembling that keeps the best valid solution and stops on convergence — The three ideas layered on belief propagation: a memory bias, relaying between attempts, and ensembling the best converged solution.

The memory bias blends each bit's prior with its previous belief by a per-node strength that is allowed to be negative. Relaying runs the decode as several legs, each starting from the last leg's beliefs with freshly drawn strengths, which is like trying a new path to the answer. Ensembling keeps the lowest-weight valid solution and stops the moment one leg converges. On the gross code this beats general-purpose post-processing decoders by about an order of magnitude in logical error rate (IBM, arXiv:2506.01779).

The machine, up close

Everything that follows is a property of four small units repeated across the code graph. It is worth seeing them at the circuit level first, because the cost, the wall, and the win all point back to one of these four.

The check unit is one per check. It finds the two smallest incoming magnitudes and the position of the smallest, combines the signs into a parity, and sends out only that compressed tuple. Finishing the per-edge detail is deferred to the bit side, which is what keeps the output bus narrow and the wiring affordable.

Check unit micro-architecture: split sign-magnitude messages, a two-minimum merge tree producing the smallest and second-smallest magnitude plus the selector, a sign XOR into a parity bit, and a compressed four-field output to the bit units — Check unit: a two-minimum merge tree plus a parity sign, emitting one compressed tuple instead of a value per edge.

The bit unit is one per error bit. A small selector rebuilds each incoming message from the compressed tuple, the bit sums them with its prior and saturates, and a forward-backward subtract produces each outgoing message and the current hard decision. That rebuild step is what pays back the check unit's compression.

Bit unit micro-architecture: a selector multiplexer rebuilds each check message from the compressed tuple, a sum-and-saturate accumulates with the prior, and a forward-backward subtract produces each outgoing message and the hard decision — Bit unit: a selector rebuilds the deferred minimum, then accumulate, saturate, and a forward-backward update per edge.

The memory-bias unit is the Relay-BP idea in hardware, one per bit. It blends the prior with the last belief in a single multiply and a shift, and a pinned bit passes its prior straight through. The per-node strength can be negative, and that is precisely what lets a decode escape the symmetric stall.

Memory-bias unit micro-architecture: a difference, one signed multiply by the per-node strength, an arithmetic shift, an add back to the prior, and a pinned-bit passthrough multiplexer — Memory-bias unit: one multiply, one shift, add to the prior, with a pinned-bit passthrough.

The control wraps the loop. A leg counter indexes a table of per-leg strengths feeding the memory-bias unit; the belief carries from one leg to the next. A count of the still-failing checks drives two registers: one remembers the best attempt so far, the other counts how long progress has stalled. A cleared error signal ends the decode at once. This is where the one real optimization lives.

Control micro-architecture: a per-leg strength table feeds the memory-bias unit, a count of failing checks drives a keep-best register for ensembling and a stall counter for early termination, and a feedback path starts the next leg carrying the belief forward — Control: a per-leg strength table, a keep-best register for ensembling, and a stall counter for early termination.

Why the hardware is large, and where the clock goes

The reference silicon decoder is fully parallel: one compute unit per graph node, one wire per edge, a whole round in two clock cycles. There is no central message memory. The messages live in the wiring itself, which is why the design is dominated by routing rather than logic.

Fully-parallel decoder: a check stage and a bit stage exchange messages over the wiring in a two-cycle loop, with a syndrome check driving early termination, and no central message memory — Fully parallel: a two-cycle loop with messages carried in the wiring, not a memory.

We measured a synthesizable distance-5 build, a faithful proxy for the gross code, and compared it against the published gross-code silicon. About 70% of every critical path is wire delay moving messages across the die, an intrinsic property of this kind of code graph. The clock is bounded by wiring, not by logic. That is why the result lands on the reference envelope rather than beating it: the wiring is the wall, and it is the same wall for anyone building this fully parallel.

About 70% of each critical path is routing across the die rather than logic, so the clock and resource use land on the published silicon envelope rather than beating it — Roughly 70% of each path is routing across the die. The clock and resource use land on the published envelope.

What moved the latency, and what did not

Decode latency is iterations times cycles-per-iteration divided by clock. With the clock pinned by routing, the honest map of what moves latency is short, and several intuitive levers actively hurt.

Lever map: choosing the right number precision cut iterations to a quarter and helped, early termination cut worst-case latency and helped, while pipelining the compare tree, time-multiplexing the compute, and compacting the floorplan were dead ends for a latency-first design — The measured outcome of each lever. Two helped; three were dead ends for a latency-first design.

Choosing the right number precision cut iterations to a quarter, a real win: a narrower format saved area but converged far slower, so its saving was a latency trap. Pipelining the compare tree raised the clock but added a cycle per iteration, making per-decode latency worse. Time-multiplexing the compute could not meet the microsecond budget. Compacting the layout into fewer die regions traded crossing delay for congestion and regressed timing. The lever that works is algorithmic: end an attempt the moment it stops making progress.

Early termination on a stalled count of failing checks cuts worst-case iterations by about a third, and the logical error rate improves because each fresh relay explores a new solution path, at a cost of about half a percent of area — Early termination on a stalled check count cuts worst-case iterations by about a third, and accuracy improves.

Early termination cut worst-case iterations by about 32% with no accuracy loss. The logical error rate actually improved, because each fresh relay explores a new path to a valid solution. In hardware it costs a count of the failing checks and a small counter, about half a percent of area and a couple of megahertz of clock, a clear trade for a latency-first decoder.

What this says about the method

The decoder, generated from a model proven equal to the math at every layer, sits on the published silicon envelope for the gross code: the same clock regime, the same resource scale. It is not faster, because the wall is the code graph's wiring, which is the same for any fully parallel build. The transferable result is the loop and the honest map: on a wiring-bound design the clock is a property of the graph, so spend effort on fewer iterations or fewer wires, not on pipelining or floorplanning, and validate any early-stopping policy on the logical error rate at scale, not just on convergence. Every number here traces to a measured run, and every layer is held to the one below it bit-for-bit.

For the full technical detail (the circuit-level micro-architecture of all four units, the measured resource and clock tables, and the verification chain) see the technical report.