A 5G Error-Correction Decoder,
Generated From Python and Proven on Silicon

A specialised hardware decoder for the 5G New Radio low-density parity-check code, built not by hand but emitted from an executable model that is held bit-exact to the mathematics at every step.

Every 5G base station decodes millions of these blocks per second. The decoder sets the latency floor for the radio link and a large share of the silicon bill, so its speed, area, and above all its correctness are not academic - they ship.

463MHz
Clock, closed
timing signed off, setup + hold
2.01Gbps
Throughput
measured in simulation, 20 seeds
90BRAM
Block memories
vs the reference IP's 109
0DSP
Hard multipliers
none required
1.6×
vs reference IP
throughput, single engine
The claim this report defends

Generate the hardware from an executable model proven equal to the mathematics at every layer, and a single-engine decoder can beat a configurable, massively-parallel commercial core on throughput, clock, and area at once.

Results at a glance

Fast, small, and verified - on the reference part

Every number below is read directly from place-and-route sign-off and cycle-accurate simulation on xcku13p-ffve900-2-e - the exact device the commercial reference core is benchmarked on - so the comparison is like-for-like.

MetricMeasuredHow it was obtained
Clock frequency463.2 MHzworst negative slack -0.059 ns, setup and hold both closed
Throughput (deployed)2011 Mbps20 random channels, early-stop on, ~2.55 passes per block
Latency per block1.92 usmean over 20 seeds (range 1.64 - 2.16 us)
Look-up tables43,39239,295 logic + 4,097 as distributed memory
Flip-flops35,496pipeline registers
Block RAM90+ 3 half-blocks (91.5 tiles)
Hard multipliers0the scaling is a shift-and-add in fabric

Those are the raw figures. The interesting question is what they look like next to the part's incumbent: a configurable commercial LDPC core.

MetricThis designReference IPReading
Clock463.2 MHz459 MHzmatched on the same part, single engine
Throughput (as deployed)2011 Mbps1225 Mbps1.6× - from a leaner engine, not a wider one
Block RAM90~10917% less on-chip memory
Look-up tables~43 K~49 Kspecialised vs configurable
Flip-flops~35 K~57 Kspecialised vs configurable
So what?

The reference core wins raw compute density by brute force - it instantiates 128 check engines in parallel. This design runs one engine, then wins the deployed race anyway by doing far less work per block (more on that below) and stopping the moment the answer is correct. Same clock, less memory, higher delivered throughput. That is the area-and-clock half of the thesis, measured.

Where the silicon goes. The forward and back-end halves of the datapath dominate; control and the early-stop check are nearly free.

BlockLUTs% of logicFlip-flopsRole
Forward half20,77948%7,034build messages, find the two smallest
Back-end half16,32938%18,597check-node maths + belief update
Memory + glue4,60411%9,125holds all 90 block RAMs
Early-stop check1,2243%392free-running, ends decoding early
Overlap delay line315<1%301shift-register bridge
Schedule controller51<1%15one address counter, no more
The tension

Two ways to build hardware, and both give something up

A digital-radio block like this is traditionally built one of two ways, and each carries a cost that shows up exactly where you cannot afford it - in a part that ships at volume and must be provably correct.

Hand-written hardware
Fast and lean, because a human controls every wire - but slow to write, easy to get subtly wrong, and almost impossible to review line-by-line for a safety case. A one-cycle misalignment can hide for weeks.
High-level synthesis
Safe and quick, because a compiler writes the wires - but it cedes control of the exact silicon, and the bit-level behaviour is whatever the tool decided, which is the one thing a standard-compliant decoder cannot leave to chance.
The unmet need

How do you get hand-written speed and area, compiler-grade safety, and bit-exact, reviewable correctness - all three, with no trade?

The trade nobody wants to make Correctness Speed & area Productivity hand RTL drops productivity HLS drops control this work aims for the centre
The three goals pull against each other; the approach below targets the middle.

The resolution is a single idea about where the hardware comes from.

The approach · part 1, the idea

The hardware is generated from a model, not written by hand

The decoder is the output of a framework that turns a Python description of an algorithm into synthesisable hardware through a chain of models, where each model is proven to behave identically to the one above it. Correctness is not tested in at the end; it is carried down by construction.

Three models, each proven equal to the one above Mathematical reference plain Python - the published algorithm, no notion of time or hardware Cycle-accurate model Python with explicit pipeline stages - the architecture, clock by clock Synthesisable hardware must match the cycle model wire-for-wire, every clock proven equal proven equal adapts to adapts to The golden rule change the model first, re-prove every test, then regenerate the RTL. never edit the hardware directly. Five proofs of equality maths = cycle model cycle model = hardware throughput + latency met place & route signed off two simulators agree
A change starts at the top and is proven downward; the hardware is never the source of truth.
Generate where known
When the algorithm matches a pattern the framework already understands, the hardware is emitted from a recipe - correct by construction, no hand-coding.
Explore where novel
When a block is new, it is built under a strict checklist - reference-grounded, debugged against the model, then folded back into the library for next time.
Right silicon block first
Every operation is mapped to the dedicated resource that fits it - small delays to shift registers, tables to distributed RAM, bulk storage to block RAM - before any tuning.
Do the necessary work only
Before making anything faster, the framework asks whether the work is even needed. For this code, that question alone removed most of it - see the algorithm section.

That is the philosophy. To see why it pays off here, you have to understand what the decoder is actually computing.

The approach · part 2, the problem it solves

Decoding a 5G low-density parity-check code

The 5G New Radio standard protects data with a parity-check code: a large, sparse set of parity equations the transmitted bits must satisfy. The decoder receives noisy soft estimates of each bit and iteratively reconciles them against the equations until every parity check is satisfied. The structure of those equations is what makes efficient hardware possible.

Bits and parity checks form a sparse graph bit estimates (variable nodes) b0 b1 b2 b3 b4 message parity equations (check nodes) c0 c1 c2 each edge = one bit appearing in one parity equation
Belief flows back and forth along the edges until every check is satisfied.

Belief propagation, one equation at a time

The decoder sweeps the parity equations in layers. For each equation it gathers the current belief of every participating bit, decides how strongly that equation wants each bit to flip, and writes the updated belief straight back before moving on. Updating immediately - rather than at the end of a full sweep - lets the answer converge in just two or three passes.

read beliefsfrom memory form messagessubtract old reply check nodescaled min-sum update beliefwrite back repeat for every equation, then check if all parities hold
The check-node maths, in plain terms

A parity equation's reply to a bit is the smallest incoming message from the other bits, scaled down slightly, with a sign set by their combined parity. No multiplications, no look-up of a transcendental - just compare, select, and a fixed three-quarters scale done as a shift-and-add.

The structure that makes it parallel: lifting

The 5G code is built from a small template of equations, then "lifted" 384-fold: every entry of the template becomes a 384-wide block in which all 384 copies do the same operation, merely rotated by a fixed amount. That is why a single engine, 128 lanes wide, can stream the whole code - and why the only data-shuffling primitive needed is a cyclic rotation.

One template entry expands into a 384 × 384 rotation block small template 17 . . entry = "rotate by 17" lift ×384 lifted 384 × 384 block: a shifted identity every "1" sits 17 positions along the diagonal - all 384 lanes rotate by the same 17 = one barrel rotate
Because every block is just a rotation, the hardware needs one barrel shifter, not a custom permutation per equation.

The biggest win came from doing less, not going faster

At the rate this decoder targets, the code's template has 42 layers of equations - but only 9 of them carry transmitted bits. The other 33 each touch a single bit that the standard never sends, so their parity reply is always zero: real work, scheduled, executed, with no effect on the answer.

Skipping those dead layers is provably identical to grinding through all 42 - same decoded bits, same convergence - but it cuts the work per pass by roughly three times. That single observation, found by asking "which operations actually change the output?", is most of the throughput lead over the reference core, and it dropped the block-RAM count too.

So what?

The reference core processes the full template and wins by sheer parallelism. This decoder processes a third of the template on one engine and arrives first. Necessary-work-first beat brute force.

42 scheduled layers, 9 that matter 9 live 33 dead work per pass: 3× less   ·   decoded result: identical a dead layer's parity reply is always zero - removing it changes nothing
Proven bit-identical to the full computation over thousands of trials.
Code parameterValueMeaning
Lifting size384lanes per template entry (a single engine streams 128 at a time, three sub-blocks per entry)
Information bits / block3,840payload protected by the code
Transmitted bits / block6,528payload plus parity actually sent
Code rate0.588fraction of transmitted bits that are payload
Live equation layers9 of 42the rest carry untransmitted bits and are skipped
Soft-input precision6 bitschannel estimate per bit; beliefs held to 10 bits, messages to 7
Check-node scale3/4normalised min-sum, done as a shift-and-add

A rotation-based engine that skips dead work and stops early - now, how is it actually wired?

The approach · part 3, the machine

One streaming engine, two overlapping halves

The decoder is a single pipeline driven by one address counter. A schedule memory plays out the list of edges; a forward half reads beliefs, builds messages, and finds the two smallest; a back-end half turns those into parity replies and writes updated beliefs back. A short delay line lets the two halves run on top of each other so a new edge can enter every clock.

Top-level datapath - new edge accepted every clock schedule one counter drives all belief + message RAM block RAM ×90 FORWARD HALF form msgsaturating rotatebarrel two-smallest finder+ sign accumulate 48% of the logic overlap delay line shift register, depth 31 BACK-END HALF check nodescaled min-sum un-rotatebarrel update beliefsaturating add 38% of the logic early-stop check free-running, ends the decode address read control updated beliefs written back signs
The forward half (blue) and back-end half (green) overlap through the delay line so the engine never stalls; the cyan checker watches the signs and ends decoding the instant all parities hold.
Design principle on display: one timeline

There is exactly one counter in the whole decoder - the schedule address. Every other timing reference is derived from it by a fixed delay. No block keeps its own counter. This is what makes a streaming iterative decoder debuggable: there is a single clock to reason about, so a misalignment has one place to be.

Watch one edge flow through the pipeline

The grid below traces a single edge as it moves stage by stage. Hover or tap any cell to follow that edge: the highlight steps diagonally because each stage receives the edge one clock later than the last. The long gap before the back-end is the overlap delay line doing its job - while this edge waits, later edges keep pouring in behind it.

schedule read belief form msg rotate two-smallest delay line check node un-rotate update belief write back
Hover a cell to follow one edge through the pipeline and see which clock each stage receives it on.
forward half memory / delay line back-end half

That is the architecture. The next question is what the heavy blocks look like inside - and where the silicon actually goes.

Inside the engine

The modules that matter

Six blocks account for nearly all the logic. Click any block in the map to open its internals; each panel shows the pipeline, the bus widths, the arithmetic, and its real measured cost from place-and-route.

Where the look-up tables go (click a block) decoder engine forward half back-end half two-smallest 8,921 LUT · 21% barrel rotate 5,236 LUT · 12% message form 6,622 LUT · 15% belief update 5,389 LUT · 12% un-rotate 5,237 LUT · 12% check node 3,961 LUT · 9% overlap delay line shift register bridge early-stop check free-running side channel

Two-smallest finder - the largest single block

A parity equation's reply needs only the smallest and second-smallest incoming message magnitude, plus which lane held the smallest, plus the running sign parity. This block streams the edges of an equation through a compare-and-keep network, holding the running winners in distributed RAM, and snapshots them when the equation ends so the back-end can read a complete equation while the next one is still arriving.

Compare-and-keep, per lane, across an equation's edges T+1 commit magnitude128 × 6b < smallest ?keep + demote < second ?keep running winners smallest / second lane index / sign parity distributed RAM, read-modify-write per-equation snapshot latched at last edge back-end reads here while next equation streams running value fed back for next edge snapshot on last edge
Holding running winners in distributed RAM, with a separate snapshot per equation, is what lets the engine accept a new edge every clock.
ParameterValueNotes
Look-up tables8,9216,137 logic + 2,784 distributed RAM
Flip-flops2,470commit registers
Bus in / out128 × 6bmagnitude per lane
Memorydistributed RAMrunning winners + per-equation snapshot

Barrel rotate - cyclic shift for a non-power-of-two width

Lifting demands a cyclic rotation by an arbitrary amount across the lanes. Because 384 is not a power of two, the rotation is staged: each stage shifts by a fixed power of four selected from two bits of the rotation amount, built as a balanced four-to-one multiplexer on the raw select bits rather than a chain of equality compares. The balanced form is roughly 2.4 times smaller than the naive cascade, and the same module is reused in reverse on the way out.

Staged four-to-one rotation - two select bits per stage lanes in256 × 7b stage 0 shift 0/1/2/3 sel = bits[1:0] stage 1 shift 0/4/8/12 sel = bits[3:2] register cut stage 2 shift 0/16/32/48 sel = bits[5:4] stage 3 shift 0/64 sel = bit[6] rotated128 × 7b a mid-stage register cut splits the rotation to help timing - more on that in the journey
Four small stages compose any rotation up to 127; the register cut between them is the timing lever, not a redesign.
ParameterValueNotes
Look-up tables5,236each direction (forward + reverse instance)
Flip-flops1,702 / 3,047forward / reverse (reverse holds the cut)
Buildbalanced 4:12.4× smaller than a compare cascade

Message form - belief minus old reply, made safe

A bit's message to an equation is its current belief minus the reply that equation last sent it. The belief is wider than the message bus, so the result is narrowed - and crucially it is saturated, never truncated. Truncating a value that overflows flips its sign, which would silently corrupt the decode; saturation clamps it to the largest representable magnitude instead. The same care appears in the cycle model and the hardware, identically.

Subtract, detect overflow, clamp - per lane belief10b old reply 7b overflow?check high bits clampsaturate, not wrap sign + magnitude1b + 6b message out128 × 7b
Combinational and fully parallel across 128 lanes; the clamp is the difference between a correct decoder and a subtly broken one.
ParameterValueNotes
Look-up tables6,622two instances (3,331 + 3,291)
Flip-flops0purely combinational
Width in / out10b → 7bsaturating narrow

Check node - the parity equation's reply

For each lane the equation replies with the smallest other message - so the lane that held the global smallest gets the second-smallest instead. That selected magnitude is scaled by three-quarters (a subtract-and-shift, no multiplier) and signed by the combined parity. A register sits at the natural midpoint of this short chain; moving exactly which value it captures was one of the timing levers that reached the final clock.

Select the right minimum, scale by 3/4, apply sign smallest / second this lane the min? selectother-minimum register × 3/4subtract + shift capclamp magnitude ^ parity sign reply out128 × 7b
No multiplier and no look-up table: the entire check-node maths is compares, a select, a shift-subtract, and an exclusive-or.
ParameterValueNotes
Look-up tables3,961fully parallel, 128 lanes
Flip-flops998midpoint register
Multipliers0scale is shift-and-add

Belief update - old belief, minus old reply, plus new reply

The new belief for a bit is its old belief with the equation's previous reply removed and the fresh reply added in. Both steps saturate. The two additions were originally one long carry chain that limited the clock; splitting them with a register at the midpoint roughly halved the chain depth and was the single largest timing gain of the whole campaign.

Two saturating adds, split by a register old belief 10b old reply clamp register (the split) partial belief + new reply clamp new belief 10b
The register at the midpoint turned an 8-deep carry chain into two shorter ones - worth roughly 39 MHz on its own.
ParameterValueNotes
Look-up tables5,389128 lanes, two saturating adds
Flip-flops2,176midpoint + companion delays
Belief width10bsigned, saturating

Un-rotate - the reverse barrel

The parity replies come out in the equation's rotated frame and must be rotated back before they update the beliefs. This is the same staged four-to-one rotation as the forward barrel, run with the complementary shift, and it carries an extra register stage because the reverse path needed the deeper pipeline to close timing. Reusing one rotation module in both directions is the kind of saving the model-to-silicon library is built to capture.

ParameterValueNotes
Look-up tables5,237same primitive as the forward barrel
Flip-flops3,047extra register stage for timing
Buildbalanced 4:1complementary shift amount
Reuse in action

The forward and reverse barrels are one parameterised module. In the build's hierarchy report they even carry a name inherited from an earlier satellite-communications decoder in the same library - concrete evidence that the primitives travel across applications, not just across instances.

Overlap delay line - a shift-register bridge

For the two halves to overlap, the back-end must process equation N while the forward half is already streaming equation N+1. The delay line carries the forward half's per-edge metadata and signs forward by a fixed depth so the back-end sees a complete equation at exactly the right clock. It is built from the device's hardware shift registers, so a 31-deep delay across 128-plus signals costs almost nothing - a few hundred look-up tables.

ParameterValueNotes
Look-up tables315plus 187 shift-register primitives
Depth31 clocksmatches the forward-to-back latency
Carriesmetadata + signseverything the back-end needs, aligned

Early-stop check - a free-running side channel

A small block watches the bit signs as they stream past and accumulates each parity equation's result. The instant every equation is satisfied, it raises a flag and decoding stops - typically after just two or three passes instead of a fixed maximum. It runs entirely off the existing data stream with no back-pressure and no extra memory traffic, which is why it costs only three percent of the logic yet roughly doubles the delivered throughput.

ParameterValueNotes
Look-up tables1,224about 3% of logic
Flip-flops392per-equation accumulators
Effect~2.55 passesvs a fixed iteration count
Why it is a correctness feature, not just a speed trick

In an overlapped pipeline, simply stopping is not safe: a few in-flight edges from the next pass would overwrite the converged answer. The stop is gated so those drain harmlessly. Treating early termination as part of correctness - not an optional optimisation - is what lets it be trusted.

Those are the parts. Getting them from a working decoder to a 463 MHz one was a campaign in its own right - and the dead-ends are as instructive as the wins.

Reflection · the optimisation campaign

From 222 to 463 MHz, including the wrong turns

The first working build ran at 222 MHz. Reaching the reference part's clock took a sequence of register placements, each validated against the model before it was kept. Two of the most useful lessons came from attempts that failed, and they reframed how the whole problem was understood.

Clock frequency across the campaign 250 350 450 reference IP: 459 MHz 222 243 246 279 301 307 346 343↓ 348 371 399 400 449 462 463 the dip is a logic fix that routing ate back - the clue that routing, not logic, was binding final, signed off
Each bar is a sign-off measurement, not a prediction. The dashed line is the reference core's clock; the dashed-red bar is a step that regressed.

Register inside the long path, not at its edges

Early attempts placed registers at module boundaries and did nothing - they only moved the endpoints of the slow path. The gains came once registers were cut inside the long combinational chains, splitting them in two.
first real win: 265 → 279 MHz, then 279 → 301 MHz

Split the carry chain in the belief update

The two saturating additions were one deep carry chain. A register at the midpoint halved it - the single biggest jump of the campaign.
307 → 346 MHz, +39 MHz from one register

A perfect logic fix that made things slower

One change cut a slow path's logic cleanly, passed every correctness check bit-for-bit, and even saved resources - yet the clock dropped. The path was limited by wire length, not logic depth, so shortening the logic just let routing spread out and give the time back. This wrong turn revealed that the remaining gap was a routing problem, which redirected the whole approach.
346 → 343 MHz (regression), kept as evidence, not merged

Registers at the memory write, six times over

Once routing was understood as the limiter, the fix was to register each memory write so the long wire from compute to memory got its own clock stage. Six such cuts - each re-validated against the model and against convergence - carried the design the rest of the way.
348 → 463 MHz, setup and hold both closed
An honest counter-intuition

Conventional wisdom says do not over-tighten the timing goal. Here, tightening it past the apparent ceiling drove the placer to work harder and actually raised the achieved clock, because this design had placement headroom rather than a logic wall. The lesson is not "always over-constrain" - it is "measure which resource is actually binding before trusting a rule of thumb."

The campaign closed the clock. Stepping back, what does the whole exercise actually demonstrate?

Significance & outlook

What this unlocks beyond one decoder

The headline is a single-engine 5G decoder that beats a configurable commercial core on throughput, clock, and memory at once, on the core's own benchmark part. But the transferable result is the method that produced it.

Because the hardware is generated from a model that is proven equal to the mathematics, the same flow re-targets a different code rate, a different block size, or a different standard by changing parameters, not by re-verifying hand-written wires. The primitives - the rotation, the two-smallest finder, the saturating arithmetic - already travel between this decoder and others in the library. And the optimisation lessons (register inside the path, find the binding resource before fixing, treat early-stop as correctness) are not specific to error-correction at all.

The thesis, now earned

Start from a model proven equal to the maths, do only the necessary work, map every operation to the right silicon, and a specialised single engine wins on throughput, clock, and area together - measured on the incumbent's own part. The hook asked whether you can have speed, safety, and correctness without trading. The answer, on real silicon, is yes.

Retargetable
New rate, size, or standard is a parameter change - the proof chain regenerates, it is not re-derived by hand.
Reusable primitives
Rotation, two-smallest, saturating maths already shared across multiple decoders in the library.
Portable lessons
"Register inside the path", "find the binding resource first", "early-stop is correctness" generalise far beyond LDPC.
Verifiable
Five independent equalities, from maths down to two simulators, make the result reviewable line by line.

The full scorecard

DimensionThis designReference IPVerdict
Clock (same part)463.2 MHz459 MHzmatch
Delivered throughput2011 Mbps1225 Mbps1.6×
Block RAM90~109leaner
Look-up tables~43 K~49 Kleaner
Flip-flops~35 K~57 Kleaner
Hard multipliers00match
Engines1128-wayspecialised vs configurable
Correctnessbit-exactstandardcross-checked to a reference toolkit

Reference-core figures are the published datasheet envelope for the same device; this design's figures are place-and-route sign-off and cycle-accurate simulation from the current build. The pure architecture-to-architecture comparison (same clock, same fixed iteration count) favours the 128-way core on raw compute density - the win above comes from doing less work per block and stopping early, which is the point.

APPENDIXDesign methodology, in depth