AI-driven FPGA Resource Optimization of a Viterbi Decoder

Convolutional forward error correction sits in the data path of nearly every wireless standard, and the hard engineering is not the decoding but fitting the decoder onto a cheap, small FPGA: an FPGA resource-optimization problem in LUTs, flip-flops and block RAMs. This report covers a soft-decision Viterbi decoder built the model-first way (a Python reference proven against the math, a cycle-accurate model proven against the reference, and Verilog proven bit-exact against the model), then handed to an AI to optimize its FPGA resources the way an engineer would: read the reports, pick a lever, measure, tell the right fix from the wrong one, back out of dead ends, and verify every change bit-exact. The headline result is that folding cuts the area roughly fourfold while the clock holds, once the read micro-architecture is right.

WHY FOLDING IS THE RIGHT LEVER (motivation)

A fixed application has a throughput ceiling, and the decoder only has to meet it. 802.11a tops out at 54 Mbps, while a full-rate Viterbi at 160 MHz emits one bit per clock (near 160 Mbps), almost 3x more than the standard asks. That spare clock is exactly what folding spends: time-multiplex the compute over several cycles instead of laying down a fully parallel engine, and as long as throughput stays above 54 Mbps the reclaimed silicon is free. Size the hardware to the required throughput, not the maximum achievable one.

THE CENTRAL CLAIM

Folding is an efficient area architecture, not a clock penalty. Whether a folded decoder keeps its clock is decided by the FPGA micro-architecture: place the reorder a fold imposes where it is free, by rotating the buffer and reading a fixed tap, and the folded engine keeps a single engine's clock at a folded engine's area.

1

Results at a Glance

▼

160.2MHz

Fmax (deployed)

K=7 fold 1, xc7z010-1, post-route, target 160 met

1,918LUT

Logic

740 FF, 2 BRAM, 0 DSP

160Mbps

Throughput

II = 1, one decoded bit per clock

433MHz

UltraScale+ reference

xczu28dr-2, same RTL, 2.4 ns target

0LSB

RTL vs model

bit-exact across K=7/9 x fold 1/2/4/8

THE TENSION THIS REPORT RESOLVES

Decoding throughput scales with the trellis: doubling the constraint length doubles the butterflies. To fit a small part you fold (share one butterfly bank across several clocks), but folding a butterfly forces a data reorder, and the obvious way to do the reorder (a phase-indexed read multiplexer) sits squarely on the timing path and caps the clock. So how do you fold for area without paying for it in megahertz?

Three views of the same generated RTL. The deployed target is the 7-series Zynq; the UltraScale+ column is the same source at a tighter clock target, and the deep-rate column shows the family stretched to rate 1/6.

Fold	Read engine	Fmax	LUT	Throughput	Use when
1	unfolded	160.2 MHz	1,918	160 Mbps	clock + throughput bound
2	rotating-read	~160 MHz	1,368	80 Mbps	balanced
4	demux	147 MHz	980	~37 Mbps	area bound
8	demux	147 MHz	681	~18 Mbps	many slow channels

xc7z010clg225-1, post-route. fold-1 from fold1_ws_final; fold 2/4/8 LUT post-route from the fold finding.

Fold	Read engine	Fmax (median of 3-place sweep)	LUT	FF
1	unfolded	(published)	10,885	405
2	rotating-read	350.8 MHz (312.8 / 350.8 / 371.2)	9,113	4,536
4	demux	322.8 MHz (289.9 / 322.8 / 336.5)	5,116	5,241
8	demux	322.0 MHz (316.9 / 322.0 / 347.3)	2,627	5,601

xczu28dr-ffvg1517-2-e, 2.4 ns target, K=9 (256 states). Fmax is the median of an AltSpreadLogic / Explore / ExtraPostPlacementOpt place sweep.

Soft / metric width	Fmax	LUT	FF	BER @ 2.5 dB	Note
4b / 11b (baseline)	150.1 MHz	2,700	797	1.04e-3	rate 1/6, fold 1
4b / 10b (free)	144.4 MHz	2,573	731	bit-exact	metric-width floor
3b / 9b	161.6 MHz	2,381	676	1.52e-3	~1.4x worse BER, -12% LUT

xc7z010-1, rate-1/6 K=7, post-route. Narrowing the datapath RAISES Fmax and cuts LUT; folding does the opposite.

SO WHAT

Every row is the same correct decoder. The fold and width knobs walk a Pareto front from "fast and large" (fold 1, 160 Mbps) to "small and slow" (fold 8, 681 LUT). The rest of the report is how that front was found, and why the fold-2 row reads ~160 MHz instead of the 147 MHz it started at.

2

Scope and Parameters

▼

What it decodes

Soft-decision Viterbi for rate 1/2 to 1/7 convolutional codes, constraint length K = 3..9 (flagship K=7, 64 states). Standards: 802.11a/g, LTE, DVB, CCSDS, Intelsat.

Interface

Streaming, not block. II = 1 at fold 1: one trellis step and one decoded bit per clock, no handshake stall. Continuous-input, self re-arming.

Knobs

fold {1,2,4,8} trades butterfly fabric for clocks-per-step; soft / metric width trades coding gain for area; traceback {windowed, block, register-exchange} trades BRAM for latency.

Reference

The reference math is ka9q / libfec viterbi27, read line by line. Polynomials 0o133 / 0o171. Survivor traceback depth 42 (the BER-cliff floor).

Parameter	Value	Meaning
K (constraint length)	7 (3..9)	2^(K-1) = 64 trellis states
SOFT_W	4 (1..8)	soft-decision input bits per symbol
METRIC_W	9	path-metric register width (re-normalized)
TB_DEPTH	42	traceback depth, ~6x K (BER-cliff safe)
POLY_A / POLY_B	0o133 / 0o171	generator polynomials (industry standard)
fold	1 (deployed), 2/4/8	clocks per trellis step = fold
FPGA target	xc7z010clg225-1	Zynq-7010 (ADALM-Pluto class); UltraScale+ for reference

3

Algorithm Fundamentals

▼

A convolutional encoder is a small finite-state machine: K-1 memory bits, one input bit per step, n coded outputs. The decoder must find the single most likely path through the resulting trellis given noisy soft observations. Viterbi does this exactly with one recurrence repeated per step, Add, Compare, Select (ACS), then walks the surviving decisions backwards to recover the bits.

branch metric   bm(s,b)   = sum over outputs of |soft_in-expected_bit|   (Hamming-ish distance)
ACS recurrence  m_new[s]  = min( m_old[s0] + bm0 , m_old[s1] + bm1 )       (the feedback loop)
survivor        dec[s]    = argmin of the two                              (1 bit written to memory)
traceback       bit[t]    = follow dec[] backwards TB_DEPTH steps from the best state

The ACS line is both the core and the constraint: m_new depends on m_old, so it is a feedback loop. You cannot pipeline a register into the middle of it without changing throughput, which makes the ACS cone a "recurrence" class timing problem, and it is why every clock-rate win in this project came from restructuring the work around the loop, never inside it.

One ACS unit, one per state: 64 run every step (256 for K=9), paired into 32 radix-2 butterflies. The red loop is the recurrence that bounds the clock.

Algorithm to architecture

Math operation	Hardware	Silicon block
branch metric bm(s,b)	small adder tree (n soft inputs)	fabric LUT (feed-forward, pipelinable)
ACS min(m+bm)	add-compare-select cone	fabric LUT + CARRY (the recurrence)
survivor decisions	2^(K-1) x TB_DEPTH bit array	write-mirrored BRAM (2 read ports)
traceback	counter-driven windowed read-back	BRAM read + counter timeline
parallelism dial	fold = clocks per step	time-multiplex one butterfly bank

No multiplies anywhere: soft metrics are absolute differences, so the whole decoder maps to LUT and BRAM with zero DSP. That is the bridge into the architecture: a feed-forward metric front end, a recurrent ACS core, and a survivor memory with a traceback reader.

4

Architecture Overview

▼

Four composed leaves. Soft symbols enter the butterfly bank, the path-state register holds the 64 metrics and feeds them back, decisions stream into a write-mirrored survivor memory, and a counter-driven windowed-traceback controller reads them out as bits. Control travels in the same registered struct as its data (aligned-by-design), so a misbehaving stage is the bug, not a delay-chain artifact two stages downstream.

Top-level composition. The bfly + path_state pair is the recurrent ACS core; everything else is feed-forward.

Three traceback styles ship as Pareto variants on the same ACS core; pick BRAM vs latency:

Traceback	BRAM	Latency	LUT / FF (xczu28dr)	Fmax	Role
windowed	2	4·TB+1	2,604 / 667	433.3 MHz	deployed 802.11a (min logic)
block argmin	1	2·TB+4	3,846 / 1,287	422.5 MHz	half the BRAM
register exchange	0	TB+1	fabric survivors	-	lowest latency, short frames

post-route xczu28dr-2, soft4 / metric9 / TB=42. The windowed variant is the deployed flagship.

5

Micro-Architecture: the folded read engine

▼

Click a module to inspect it. The two that matter for the clock are the recurrent ACS cone and the folded read engine that feeds it.

ACS recurrence cone (the clock-bounding loop)

Add the branch metric, compare the two candidates, select the minimum, write it back as next step's metric. The loop is m+bm -> min -> register -> m. A register dropped inside it lands in the loop and cuts throughput, so it cannot be pipelined. The levers are: feed-forward the branch-metric add out of the loop (branch-metric pipelining), and keep the loop's all-to-all route physically compact.

Parameter	Value	Notes
Class	recurrence	feedback loop, not feed-forward
Width	9 b/state	path metric, re-normalized each step
Instances	64 (K=7) / 256 (K=9)	one per state, fold-shared
Pipelinable	no (inside loop)	only feed-forward parts can be cut

Folded read engine: the centerpiece

When you fold, one butterfly bank serves several trellis positions across fold clocks, so each clock must read a different slice of the metric bank. The naive way is a phase-indexed read multiplexer: bank[cur_ph*stride + lane]. That multiplexer's select (cur_ph) fans out across the whole wide bank and sits on the read path: it caps the clock. The fix is rotating-read: rotate the buffer by the stride each phase and read a fixed tap. The phase counter no longer touches the wide read.

Same bank; only the read pattern changes. The indexed read (top) keeps cur_ph on the wide read and caps the clock; the fixed tap on a rotating bank (bottom) takes it off and lifts it.

fold 2, K=7 (Zynq-7010)	Fmax	LUT	FF
demux	~143 MHz	1,345	1,074
rotating-read	~160 MHz (+12%)	1,368	1,072

Design rule extracted

A fixed-stride windowed read must rotate a buffer and read a fixed tap, never a runtime phase-indexed read mux. That holds when the rotation mux is no larger than the mux it replaces. At fold 2 it replaces the commit mux exactly, so it is free; at fold ≥ 4 the rotation mux scales with the bank and the demux is the area-optimal choice. Hence the per-fold policy.

Windowed traceback controller

A DO-254 counter-driven controller reads the survivor memory back over a window of TB_DEPTH=42 decisions, emitting one decoded bit per step at steady state. A single counter timeline (position within the window, a two-bit rolling region, and a saturating startup phase) drives every control signal as a decode; there is no enumerated state machine, which keeps branchy next-state logic off the recurrence read path. Uniform latency 4·TB+1; no data-dependent timing. Outputs are registered.

Parameter	Value	Notes
Latency	4·TB+1 = 169 cy	cold start; steady state 1 bit/clk
Style	counter-driven timeline	single counter, control derived as decodes
BRAM reads/clock	2 (mirrored)	one per chase engine: merge + decode

Write-mirrored survivor memory

2^(K-1) decision bits per step, inferred as block RAM via UG901 templates (no UNISIM instantiation). The windowed traceback issues two reads per clock (one for the merge engine, one for the decode engine) plus one write, which a single block RAM cannot serve, so the memory is two identical mirrors, both written every cycle, each supplying one chase read port. Addressing rolls through four regions so a write and a read normally target different regions; the one same-cycle, same-address case (a region's newest entry, read by the merge engine the cycle it is written) is forwarded by a 1-deep write-first bypass in the core, so a read-during-write never returns stale data.

Parameter	Value	Notes
Geometry	64 b x 256 (4 regions)	K=7; 256 b wide for K=9
Block	2x RAMB (write-mirrored)	two copies = two read ports
Collision	same-addr R/W	1-deep write-first bypass forwards the new word

6

The Optimization Journey

▼

This is the loop AI ran on the design: folding is the headline area lever, but it appears to cost clock until AI reads the critical path and pins the loss on the read micro-architecture, not on folding. The wins came from restructuring that read, from per-lane metric select, and from bit-width, each measured rather than guessed; one alternative regressed and was reverted. Every number below is post-route on the named part.

The clock across the journey: from naive folding to the rotating-read micro-architecture, the folded fold-2 clock climbs back to the full-rate class; folding alone did not.

Step 1: Folding alone is a weak area lever

Sharing the butterfly across 8 clocks cut LUT 31% but doubled FF and dropped Fmax 34%. The wholesale metric-bank control (a fanout-675 reset and fanout-640 commit net spanning the whole bank) floored folded Fmax around 100-120 MHz on 7-series. The fixed costs (shared branch-metric table, full-width bank) are exactly what folding cannot shrink.

Fmax -34% | LUT -31% | FF +144% | throughput /8

Step 2: Per-lane branch metric (remove the shared 64:1 mux)

A feasibility-ceiling probe showed removing the shared metric mux was worth ~14%, so instead of chasing the mux, each lane keeps only the few metrics it needs, selected locally. Fanout collapses and the selects shrink.

fold8 100.5 -> 123.5 MHz (+23%) | LUT -38% | FF flat

The one structural change that broke the fold-2 ceiling, and the central claim of this report.

Step 4: Pipeline the branch metric for deep rates

At rate 1/6 the branch metric sums six soft inputs, a tree that sat combinationally in front of the ACS add and lengthened the recurrence cone. Registering that sum (a feed-forward cut, +1 cycle latency) took the tree out of the loop.

rate 1/6: 150.1 -> 163.9 MHz (+9.2%) | LUT flat | FF +200

Step 5: bit-width, the real deep-rate winner

Narrowing soft 4b/metric 11b to 3b/9b shrank both the metric table and the bank, the fixed costs folding could not touch. It raised Fmax (shorter datapath) while folding lowers it. The cost is a ~1.4x BER penalty that is still ~2x better than the rate-1/2 baseline.

rate 1/6: 150.1 -> 161.6 MHz | LUT -12% | BER 1.04e-3 -> 1.52e-3

Step 6: The dead-end (reported, not hidden)

A dual-bank ping-pong read was tried as an alternative reorder. It added a spare bank and a full-width toggle mux and still needed the parallel-window read. Measured on the target part it regressed: +86 LUT and -27 MHz versus rotating-read (bigger AND slower: the spare bank costs the area). Reverted. The negative result is what confirmed rotating-read was the right reorder placement, not a lucky one.

+86 LUT, -27 MHz vs rotating-read | reverted

The unifying principle (codified back into the framework)

Time-folding any radix-2 butterfly (Viterbi, FFT, a sorting network) forces a perfect-shuffle reorder: one side reads sequentially, the other strided. A reorder element is therefore unavoidable. The only question is where you pay for it. The three engines below are the three placements; the win is choosing the one whose reorder is free at your fold.

The 3-engine taxonomy. The deployed policy: fold 2 rotating-read (clock-first), fold ≥ 4 demux (density-first).

Step 7: Deployment reality check (full 802.11a RX)

The fold-2 rotating-read engine was dropped into a full 802.11a receiver and run on raw samples. All 8 modulation/coding schemes decoded error-free, byte-exact to the soft reference, across three back-to-back packets with no reset between them (continuous re-arm). A fold-4 probe was rejected on a throughput-feasibility check: the hardest scheme needs 54 M trellis-steps/s and fold 4 at 147 MHz delivers only 37 M; fold 2 sustains 80 M. The standalone +13 MHz banks as slack in the receiver, whose binding path is the equalizer, not the decoder.

8/8 MCS error-free | +0.6 MHz on the worst path | -95 failing endpoints | resources flat

7

Resource and Trade-off Comparison

▼

Versus the Xilinx PG027 Viterbi IP at comparable settings (soft 3, TB=42): the generated windowed decoder is competitive on logic and uses far fewer flip-flops, at zero DSP.

Design	LUT	FF	BRAM	DSP	Notes
Xilinx PG027 (soft3, TB=42)	2,260	1,728	2	0	configurable IP
This work, windowed (soft4, TB=42)	2,604	667	2	0	post-route xczu28dr-2

Median-of-sweep Fmax vs post-route LUT. Each point is a valid shipped configuration.

8

Verification

▼

Correctness is a chain of bit-exact equalities, each layer proven against the one above it. The RTL is never the reference: it adapts to a cycle model that adapts to a math model that is proven against an independent library.

Three-model hierarchy. Each arrow is a bit-exact (zero-bit) equality on hardest-input vectors.

Math = model

The cycle-accurate model is bit-exact to the maximum-likelihood reference (libfec / ka9q) on impulse, all-zero, and hardest random codewords.

Model = RTL

0-LSB across K=7/9 x fold 1/2/4/8; value-transparent through the fold reorder.

Continuous re-arm

3 back-to-back packets, init held inactive after cold start, full bit count, 0 errors.

DO-254

Strict RTL coding discipline: zero violations across every generated module and fold factor.

SNR (BPSK / AWGN)	K=7 BER	K=9 BER	Coding gain
1 dB	6.24e-2	6.19e-2	-
2 dB	1.12e-2	5.39e-3	K=9 ~2x better than K=7
3 dB	8.54e-4	< measurement floor	-

BER waterfall, soft 4b, rate 1/2. K=9 buys coding gain at 4x the states: the constraint-length knob.

9

Design Guidelines

▼

1: Measure the ceiling, do not guess it

A feasibility-ceiling probe told us the shared-mux removal was worth ~14%, so we restructured per-lane instead and got 23%. Probes redirect effort to the real lever.

2: A folded butterfly owes a reorder

You cannot avoid the perfect shuffle. Choose where to pay: rotate (free at fold 2) or read-mux (free at fold ≥ 4). Never a fixed-stride runtime index mux on the read.

3: Restructure around a recurrence, not inside it

The ACS loop cannot take a pipeline register. Every clock win came from feed-forwarding work out of the loop (branch-metric pipelining) or compacting its route.

4: Bit-width before folding for area

Narrowing shrinks the fixed costs folding cannot touch AND raises Fmax. Folding raises FF and lowers Fmax. Reach for width first.

5: Fmax is a distribution

Report the median of a place-directive sweep for route-bound cones, not a single lucky run. The K=9 fold-2 point spans 313-371 MHz across directives.

6: Publish the dead-ends

The dual-bank regression is what proved rotating-read was the right placement. A negative result is a first-class output.

10

Conclusion and Outlook

▼

The recommended deployment is the windowed, fold-1 K=7 decoder at 160 MHz / 160 Mbps on Zynq-7010, the configuration shipped in a full 802.11a receiver, error-free across all eight schemes, at 1,918 LUT, 2 BRAM and zero DSP. Where area is the binding constraint, the fold-2 rotating-read variant holds the same clock class at smaller area; for many slow channels, fold 8 reaches 681 LUT. The same RTL closes at 433 MHz on UltraScale+.

The reusable part is wider than this one decoder, and wider than the one fix. It is the AI optimization loop: read the physical reports, separate a folding problem from a micro-architecture one, place the unavoidable reorder where it is free, build-measure-revert the dead ends, and prove every change bit-exact. The same reorder governs any time-folded radix-2 butterfly, so rotating a buffer and reading a fixed tap instead of indexing a runtime mux is the same move that helps a folded FFT or a folded sorting network. This decoder is where that loop was run and proven on real silicon, and the rule now lives in the framework for the next design.

In one line

Folding is an efficient area architecture; whether it keeps its clock is a micro-architecture decision. AI cut the area roughly fourfold and held the clock by placing the fold's reorder where it is free. Measured: fold 2, 147 to ~160 MHz, flat area, shipped.

▼

APPENDIX ADesign Methodology

The principles that made the journey repeatable, with the failure each one prevents.

A. The model is the reference (model-first)

All structural changes (per-lane metric select, rotating read, branch-metric pipelining) were made first in the cycle-accurate model, proven bit-exact there, then translated to Verilog and re-proven. The RTL is never edited to chase a number; the model is the hardware truth. Prevents: model divergence that hides fold reorder bugs.

B. Feasibility ceiling before build

Before implementing any cited mechanism, measure a baseline, a ceiling, and the simpler alternatives first. The shared-mux ceiling probe (worth ~14%) redirected the effort to the per-lane restructure (worth 23%). A probe is representative, not exhaustive: it measures an ideal structure, so it can under-estimate; build the real thing when the target structure is clean.

C. Fmax is a distribution (A/B at the same target)

Route-bound cones vary 80-200 ps across place directives. Every fold Fmax in this report is the median of a 3-directive sweep at the same clock target. Comparing two engines means median-vs-median, same part, same target, same session: never a single run against a differently-measured baseline.

D. Recurrence cone levers

A feedback-loop cone (ACS, a non-restoring divider) cannot be fixed by a mid-route register (it lands in the loop) or a floorplan alone. The levers, in order: re-read the SOTA reference (the loop may be a wrong-algorithm artifact); feed-forward the non-recurrent parts around it (branch-metric pipelining); or split it into a slower clock domain. This project used the first two.

E. Hardest input, continuous operation

Test vectors include the hardest realistic codewords and valid-gap stimulus, and a continuous-input design must self-re-arm with no reset between operations. The 802.11a integration ran three back-to-back packets with init held inactive to prove the re-arm: a reset-between-packets harness would have hidden a re-arm dependency.

F. Inherit and compose verified IP

Each RTL module is generated from a parameterized, separately verified building block, and the fold engines are checked bit-exact against the reference. A correction to one engine propagates to every design that composes it: the fold-2 engine the 802.11a receiver inherited is the same one proven in the standalone decoder.