From Python to Silicon - A Verified 5G LDPC Decoder

Results at a glance

Fast, small, and verified - on the reference part

Every number below is read directly from place-and-route sign-off and cycle-accurate simulation on xcku13p-ffve900-2-e - the exact device the commercial reference core is benchmarked on - so the comparison is like-for-like.

Metric	Measured	How it was obtained
Clock frequency	463.2 MHz	worst negative slack -0.059 ns, setup and hold both closed
Throughput (deployed)	2011 Mbps	20 random channels, early-stop on, ~2.55 passes per block
Latency per block	1.92 us	mean over 20 seeds (range 1.64 - 2.16 us)
Look-up tables	43,392	39,295 logic + 4,097 as distributed memory
Flip-flops	35,496	pipeline registers
Block RAM	90	+ 3 half-blocks (91.5 tiles)
Hard multipliers	0	the scaling is a shift-and-add in fabric

Those are the raw figures. The interesting question is what they look like next to the part's incumbent: a configurable commercial LDPC core.

Metric	This design	Reference IP	Reading
Clock	463.2 MHz	459 MHz	matched on the same part, single engine
Throughput (as deployed)	2011 Mbps	1225 Mbps	1.6× - from a leaner engine, not a wider one
Block RAM	90	~109	17% less on-chip memory
Look-up tables	~43 K	~49 K	specialised vs configurable
Flip-flops	~35 K	~57 K	specialised vs configurable

So what?

The reference core wins raw compute density by brute force - it instantiates 128 check engines in parallel. This design runs one engine, then wins the deployed race anyway by doing far less work per block (more on that below) and stopping the moment the answer is correct. Same clock, less memory, higher delivered throughput. That is the area-and-clock half of the thesis, measured.

Where the silicon goes. The forward and back-end halves of the datapath dominate; control and the early-stop check are nearly free.

Block	LUTs	% of logic	Flip-flops	Role
Forward half	20,779	48%	7,034	build messages, find the two smallest
Back-end half	16,329	38%	18,597	check-node maths + belief update
Memory + glue	4,604	11%	9,125	holds all 90 block RAMs
Early-stop check	1,224	3%	392	free-running, ends decoding early
Overlap delay line	315	<1%	301	shift-register bridge
Schedule controller	51	<1%	15	one address counter, no more

The approach · part 2, the problem it solves

Decoding a 5G low-density parity-check code

The 5G New Radio standard protects data with a parity-check code: a large, sparse set of parity equations the transmitted bits must satisfy. The decoder receives noisy soft estimates of each bit and iteratively reconciles them against the equations until every parity check is satisfied. The structure of those equations is what makes efficient hardware possible.

Belief flows back and forth along the edges until every check is satisfied.

Belief propagation, one equation at a time

The decoder sweeps the parity equations in layers. For each equation it gathers the current belief of every participating bit, decides how strongly that equation wants each bit to flip, and writes the updated belief straight back before moving on. Updating immediately - rather than at the end of a full sweep - lets the answer converge in just two or three passes.

The check-node maths, in plain terms

A parity equation's reply to a bit is the smallest incoming message from the other bits, scaled down slightly, with a sign set by their combined parity. No multiplications, no look-up of a transcendental - just compare, select, and a fixed three-quarters scale done as a shift-and-add.

The structure that makes it parallel: lifting

The 5G code is built from a small template of equations, then "lifted" 384-fold: every entry of the template becomes a 384-wide block in which all 384 copies do the same operation, merely rotated by a fixed amount. That is why a single engine, 128 lanes wide, can stream the whole code - and why the only data-shuffling primitive needed is a cyclic rotation.

Because every block is just a rotation, the hardware needs one barrel shifter, not a custom permutation per equation.

The biggest win came from doing less, not going faster

At the rate this decoder targets, the code's template has 42 layers of equations - but only 9 of them carry transmitted bits. The other 33 each touch a single bit that the standard never sends, so their parity reply is always zero: real work, scheduled, executed, with no effect on the answer.

Skipping those dead layers is provably identical to grinding through all 42 - same decoded bits, same convergence - but it cuts the work per pass by roughly three times. That single observation, found by asking "which operations actually change the output?", is most of the throughput lead over the reference core, and it dropped the block-RAM count too.

So what?

The reference core processes the full template and wins by sheer parallelism. This decoder processes a third of the template on one engine and arrives first. Necessary-work-first beat brute force.

Proven bit-identical to the full computation over thousands of trials.

Code parameter	Value	Meaning
Lifting size	384	lanes per template entry (a single engine streams 128 at a time, three sub-blocks per entry)
Information bits / block	3,840	payload protected by the code
Transmitted bits / block	6,528	payload plus parity actually sent
Code rate	0.588	fraction of transmitted bits that are payload
Live equation layers	9 of 42	the rest carry untransmitted bits and are skipped
Soft-input precision	6 bits	channel estimate per bit; beliefs held to 10 bits, messages to 7
Check-node scale	3/4	normalised min-sum, done as a shift-and-add

A rotation-based engine that skips dead work and stops early - now, how is it actually wired?

The approach · part 3, the machine

One streaming engine, two overlapping halves

The decoder is a single pipeline driven by one address counter. A schedule memory plays out the list of edges; a forward half reads beliefs, builds messages, and finds the two smallest; a back-end half turns those into parity replies and writes updated beliefs back. A short delay line lets the two halves run on top of each other so a new edge can enter every clock.

The forward half (blue) and back-end half (green) overlap through the delay line so the engine never stalls; the cyan checker watches the signs and ends decoding the instant all parities hold.

Design principle on display: one timeline

There is exactly one counter in the whole decoder - the schedule address. Every other timing reference is derived from it by a fixed delay. No block keeps its own counter. This is what makes a streaming iterative decoder debuggable: there is a single clock to reason about, so a misalignment has one place to be.

Watch one edge flow through the pipeline

The grid below traces a single edge as it moves stage by stage. Hover or tap any cell to follow that edge: the highlight steps diagonally because each stage receives the edge one clock later than the last. The long gap before the back-end is the overlap delay line doing its job - while this edge waits, later edges keep pouring in behind it.

forward half memory / delay line back-end half

That is the architecture. The next question is what the heavy blocks look like inside - and where the silicon actually goes.

Inside the engine

The modules that matter

Six blocks account for nearly all the logic. Click any block in the map to open its internals; each panel shows the pipeline, the bus widths, the arithmetic, and its real measured cost from place-and-route.

Two-smallest finder - the largest single block

A parity equation's reply needs only the smallest and second-smallest incoming message magnitude, plus which lane held the smallest, plus the running sign parity. This block streams the edges of an equation through a compare-and-keep network, holding the running winners in distributed RAM, and snapshots them when the equation ends so the back-end can read a complete equation while the next one is still arriving.

Holding running winners in distributed RAM, with a separate snapshot per equation, is what lets the engine accept a new edge every clock.

Parameter	Value	Notes
Look-up tables	8,921	6,137 logic + 2,784 distributed RAM
Flip-flops	2,470	commit registers
Bus in / out	128 × 6b	magnitude per lane
Memory	distributed RAM	running winners + per-equation snapshot

Barrel rotate - cyclic shift for a non-power-of-two width

Lifting demands a cyclic rotation by an arbitrary amount across the lanes. Because 384 is not a power of two, the rotation is staged: each stage shifts by a fixed power of four selected from two bits of the rotation amount, built as a balanced four-to-one multiplexer on the raw select bits rather than a chain of equality compares. The balanced form is roughly 2.4 times smaller than the naive cascade, and the same module is reused in reverse on the way out.

Four small stages compose any rotation up to 127; the register cut between them is the timing lever, not a redesign.

Parameter	Value	Notes
Look-up tables	5,236	each direction (forward + reverse instance)
Flip-flops	1,702 / 3,047	forward / reverse (reverse holds the cut)
Build	balanced 4:1	2.4× smaller than a compare cascade

Message form - belief minus old reply, made safe

A bit's message to an equation is its current belief minus the reply that equation last sent it. The belief is wider than the message bus, so the result is narrowed - and crucially it is saturated, never truncated. Truncating a value that overflows flips its sign, which would silently corrupt the decode; saturation clamps it to the largest representable magnitude instead. The same care appears in the cycle model and the hardware, identically.

Combinational and fully parallel across 128 lanes; the clamp is the difference between a correct decoder and a subtly broken one.

Parameter	Value	Notes
Look-up tables	6,622	two instances (3,331 + 3,291)
Flip-flops	0	purely combinational
Width in / out	10b → 7b	saturating narrow

Check node - the parity equation's reply

For each lane the equation replies with the smallest other message - so the lane that held the global smallest gets the second-smallest instead. That selected magnitude is scaled by three-quarters (a subtract-and-shift, no multiplier) and signed by the combined parity. A register sits at the natural midpoint of this short chain; moving exactly which value it captures was one of the timing levers that reached the final clock.

No multiplier and no look-up table: the entire check-node maths is compares, a select, a shift-subtract, and an exclusive-or.

Parameter	Value	Notes
Look-up tables	3,961	fully parallel, 128 lanes
Flip-flops	998	midpoint register
Multipliers	0	scale is shift-and-add

Belief update - old belief, minus old reply, plus new reply

The new belief for a bit is its old belief with the equation's previous reply removed and the fresh reply added in. Both steps saturate. The two additions were originally one long carry chain that limited the clock; splitting them with a register at the midpoint roughly halved the chain depth and was the single largest timing gain of the whole campaign.

The register at the midpoint turned an 8-deep carry chain into two shorter ones - worth roughly 39 MHz on its own.

Parameter	Value	Notes
Look-up tables	5,389	128 lanes, two saturating adds
Flip-flops	2,176	midpoint + companion delays
Belief width	10b	signed, saturating

Un-rotate - the reverse barrel

The parity replies come out in the equation's rotated frame and must be rotated back before they update the beliefs. This is the same staged four-to-one rotation as the forward barrel, run with the complementary shift, and it carries an extra register stage because the reverse path needed the deeper pipeline to close timing. Reusing one rotation module in both directions is the kind of saving the model-to-silicon library is built to capture.

Parameter	Value	Notes
Look-up tables	5,237	same primitive as the forward barrel
Flip-flops	3,047	extra register stage for timing
Build	balanced 4:1	complementary shift amount

Reuse in action

The forward and reverse barrels are one parameterised module. In the build's hierarchy report they even carry a name inherited from an earlier satellite-communications decoder in the same library - concrete evidence that the primitives travel across applications, not just across instances.

Overlap delay line - a shift-register bridge

For the two halves to overlap, the back-end must process equation N while the forward half is already streaming equation N+1. The delay line carries the forward half's per-edge metadata and signs forward by a fixed depth so the back-end sees a complete equation at exactly the right clock. It is built from the device's hardware shift registers, so a 31-deep delay across 128-plus signals costs almost nothing - a few hundred look-up tables.

Parameter	Value	Notes
Look-up tables	315	plus 187 shift-register primitives
Depth	31 clocks	matches the forward-to-back latency
Carries	metadata + signs	everything the back-end needs, aligned

Early-stop check - a free-running side channel

A small block watches the bit signs as they stream past and accumulates each parity equation's result. The instant every equation is satisfied, it raises a flag and decoding stops - typically after just two or three passes instead of a fixed maximum. It runs entirely off the existing data stream with no back-pressure and no extra memory traffic, which is why it costs only three percent of the logic yet roughly doubles the delivered throughput.

Parameter	Value	Notes
Look-up tables	1,224	about 3% of logic
Flip-flops	392	per-equation accumulators
Effect	~2.55 passes	vs a fixed iteration count

Why it is a correctness feature, not just a speed trick

In an overlapped pipeline, simply stopping is not safe: a few in-flight edges from the next pass would overwrite the converged answer. The stop is gated so those drain harmlessly. Treating early termination as part of correctness - not an optional optimisation - is what lets it be trusted.

Those are the parts. Getting them from a working decoder to a 463 MHz one was a campaign in its own right - and the dead-ends are as instructive as the wins.

Reflection · the optimisation campaign

From 222 to 463 MHz, including the wrong turns

The first working build ran at 222 MHz. Reaching the reference part's clock took a sequence of register placements, each validated against the model before it was kept. Two of the most useful lessons came from attempts that failed, and they reframed how the whole problem was understood.

Each bar is a sign-off measurement, not a prediction. The dashed line is the reference core's clock; the dashed-red bar is a step that regressed.

Register inside the long path, not at its edges

Early attempts placed registers at module boundaries and did nothing - they only moved the endpoints of the slow path. The gains came once registers were cut inside the long combinational chains, splitting them in two.

first real win: 265 → 279 MHz, then 279 → 301 MHz

Split the carry chain in the belief update

The two saturating additions were one deep carry chain. A register at the midpoint halved it - the single biggest jump of the campaign.

307 → 346 MHz, +39 MHz from one register

A perfect logic fix that made things slower

One change cut a slow path's logic cleanly, passed every correctness check bit-for-bit, and even saved resources - yet the clock dropped. The path was limited by wire length, not logic depth, so shortening the logic just let routing spread out and give the time back. This wrong turn revealed that the remaining gap was a routing problem, which redirected the whole approach.

346 → 343 MHz (regression), kept as evidence, not merged

Registers at the memory write, six times over

Once routing was understood as the limiter, the fix was to register each memory write so the long wire from compute to memory got its own clock stage. Six such cuts - each re-validated against the model and against convergence - carried the design the rest of the way.

348 → 463 MHz, setup and hold both closed

An honest counter-intuition

Conventional wisdom says do not over-tighten the timing goal. Here, tightening it past the apparent ceiling drove the placer to work harder and actually raised the achieved clock, because this design had placement headroom rather than a logic wall. The lesson is not "always over-constrain" - it is "measure which resource is actually binding before trusting a rule of thumb."

The campaign closed the clock. Stepping back, what does the whole exercise actually demonstrate?

Significance & outlook

What this unlocks beyond one decoder

The headline is a single-engine 5G decoder that beats a configurable commercial core on throughput, clock, and memory at once, on the core's own benchmark part. But the transferable result is the method that produced it.

Because the hardware is generated from a model that is proven equal to the mathematics, the same flow re-targets a different code rate, a different block size, or a different standard by changing parameters, not by re-verifying hand-written wires. The primitives - the rotation, the two-smallest finder, the saturating arithmetic - already travel between this decoder and others in the library. And the optimisation lessons (register inside the path, find the binding resource before fixing, treat early-stop as correctness) are not specific to error-correction at all.

The thesis, now earned

Start from a model proven equal to the maths, do only the necessary work, map every operation to the right silicon, and a specialised single engine wins on throughput, clock, and area together - measured on the incumbent's own part. The hook asked whether you can have speed, safety, and correctness without trading. The answer, on real silicon, is yes.

Retargetable

New rate, size, or standard is a parameter change - the proof chain regenerates, it is not re-derived by hand.

Reusable primitives

Rotation, two-smallest, saturating maths already shared across multiple decoders in the library.

Portable lessons

"Register inside the path", "find the binding resource first", "early-stop is correctness" generalise far beyond LDPC.

Verifiable

Five independent equalities, from maths down to two simulators, make the result reviewable line by line.

The full scorecard

Dimension	This design	Reference IP	Verdict
Clock (same part)	463.2 MHz	459 MHz	match
Delivered throughput	2011 Mbps	1225 Mbps	1.6×
Block RAM	90	~109	leaner
Look-up tables	~43 K	~49 K	leaner
Flip-flops	~35 K	~57 K	leaner
Hard multipliers	0	0	match
Engines	1	128-way	specialised vs configurable
Correctness	bit-exact	standard	cross-checked to a reference toolkit

Reference-core figures are the published datasheet envelope for the same device; this design's figures are place-and-route sign-off and cycle-accurate simulation from the current build. The pure architecture-to-architecture comparison (same clock, same fixed iteration count) favours the 128-way core on raw compute density - the win above comes from doing less work per block and stopping early, which is the point.

APPENDIXDesign methodology, in depth

▼

The principles that made the result reproducible, for the reader who wants the engineering underneath.

Aligned-by-construction, not patched after the fact

The hardest bugs in a streaming pipeline come from data and its control travelling on separate delay chains: a fault in one chain shows up cycles later at a block that is actually innocent. The discipline here is to carry data and its control together as one bundle down a single path, so a wrong value can only have come from that path. The overlap delay line is the only place signals are deliberately delayed, and it delays everything together.

One bundle down one path turns "where did this go wrong?" into a question with a single answer.

Debug by comparing to the model, stage by stage

Because a cycle-accurate model exists with the same stage boundaries, any hardware bug is found by running both on the same input and comparing at every boundary: the first mismatch is the bug, and its size hints at the cause - about half the lanes wrong means a broken datapath, a few percent means a misalignment, a single lane means an index. This turns a multi-hour hunt into a few minutes.

Measure pipeline depth, do not compute it

Register placement at module boundaries makes hand-computed latencies wrong by a cycle or two - enough to silently corrupt an iterative decoder. The delay-line depth and every alignment constant are taken from a simulation that prints the actual arrival cycle, never from arithmetic on a block diagram.

Test with the hardest realistic input

All-zero and uniform inputs hide three whole classes of bug: a wrong rotation direction, an arithmetic overflow that never triggers, and a sign error that uniform parity masks. The test set deliberately includes high-dynamic-range channels and known slow-to-converge patterns so those classes cannot hide.

Saturate, never truncate

Narrowing a wide signed value by dropping high bits flips its sign on overflow - a strong negative belief becomes a strong positive one, and the decoder diverges. Every narrowing in the design clamps to the largest representable magnitude instead, in the model and the hardware identically. This is not a detail; it is the difference between a decoder that works and one that occasionally, unreproducibly, does not.

The priority order that governed every decision

Correctness first (match the model exactly), then timing (meet the clock after place-and-route), then throughput (the real figure of merit), then resources. Every candidate change reported all four together, and a change was only kept if measurement - never prediction - confirmed it helped.

A 5G Error-Correction Decoder,
Generated From Python and Proven on Silicon