AI-driven FPGA resource optimization of a Viterbi decoder

A soft-decision Viterbi decoder, generated from a model proven equal to the math, then handed to an AI to optimize its FPGA resources: fold the area down, diagnose that the lost clock is a micro-architecture problem, not a folding one, and prove every change bit-exact. Deployed in a full 802.11a receiver.

160 MHz
Zynq-7010, 1 bit/clock
1,918
LUT (K=7, fold 1)
2 BRAM
0 DSP
8 / 8 MCS
error-free in RX PHY

Convolutional forward error correction sits in the data path of nearly every wireless standard, and the hard engineering is not the decoding but fitting the decoder onto a cheap, small FPGA: an FPGA resource-optimization problem in LUTs, flip-flops and block RAMs. This report covers a soft-decision Viterbi decoder built the model-first way (a Python reference proven against the math, a cycle-accurate model proven against the reference, and Verilog proven bit-exact against the model), then handed to an AI to optimize its FPGA resources the way an engineer would: read the reports, pick a lever, measure, tell the right fix from the wrong one, back out of dead ends, and verify every change bit-exact. The headline result is that folding cuts the area roughly fourfold while the clock holds, once the read micro-architecture is right.

WHY FOLDING IS THE RIGHT LEVER (motivation)
A fixed application has a throughput ceiling, and the decoder only has to meet it. 802.11a tops out at 54 Mbps, while a full-rate Viterbi at 160 MHz emits one bit per clock (near 160 Mbps), almost 3x more than the standard asks. That spare clock is exactly what folding spends: time-multiplex the compute over several cycles instead of laying down a fully parallel engine, and as long as throughput stays above 54 Mbps the reclaimed silicon is free. Size the hardware to the required throughput, not the maximum achievable one.
THE CENTRAL CLAIM
Folding is an efficient area architecture, not a clock penalty. Whether a folded decoder keeps its clock is decided by the FPGA micro-architecture: place the reorder a fold imposes where it is free, by rotating the buffer and reading a fixed tap, and the folded engine keeps a single engine's clock at a folded engine's area.
1

Results at a Glance

160.2MHz
Fmax (deployed)
K=7 fold 1, xc7z010-1, post-route, target 160 met
1,918LUT
Logic
740 FF, 2 BRAM, 0 DSP
160Mbps
Throughput
II = 1, one decoded bit per clock
433MHz
UltraScale+ reference
xczu28dr-2, same RTL, 2.4 ns target
0LSB
RTL vs model
bit-exact across K=7/9 x fold 1/2/4/8
THE TENSION THIS REPORT RESOLVES
Decoding throughput scales with the trellis: doubling the constraint length doubles the butterflies. To fit a small part you fold (share one butterfly bank across several clocks), but folding a butterfly forces a data reorder, and the obvious way to do the reorder (a phase-indexed read multiplexer) sits squarely on the timing path and caps the clock. So how do you fold for area without paying for it in megahertz?

Three views of the same generated RTL. The deployed target is the 7-series Zynq; the UltraScale+ column is the same source at a tighter clock target, and the deep-rate column shows the family stretched to rate 1/6.

FoldRead engineFmaxLUTThroughputUse when
1unfolded160.2 MHz1,918160 Mbpsclock + throughput bound
2rotating-read~160 MHz1,36880 Mbpsbalanced
4demux147 MHz980~37 Mbpsarea bound
8demux147 MHz681~18 Mbpsmany slow channels

xc7z010clg225-1, post-route. fold-1 from fold1_ws_final; fold 2/4/8 LUT post-route from the fold finding.

FoldRead engineFmax (median of 3-place sweep)LUTFF
1unfolded(published)10,885405
2rotating-read350.8 MHz (312.8 / 350.8 / 371.2)9,1134,536
4demux322.8 MHz (289.9 / 322.8 / 336.5)5,1165,241
8demux322.0 MHz (316.9 / 322.0 / 347.3)2,6275,601

xczu28dr-ffvg1517-2-e, 2.4 ns target, K=9 (256 states). Fmax is the median of an AltSpreadLogic / Explore / ExtraPostPlacementOpt place sweep.

Soft / metric widthFmaxLUTFFBER @ 2.5 dBNote
4b / 11b (baseline)150.1 MHz2,7007971.04e-3rate 1/6, fold 1
4b / 10b (free)144.4 MHz2,573731bit-exactmetric-width floor
3b / 9b161.6 MHz2,3816761.52e-3~1.4x worse BER, -12% LUT

xc7z010-1, rate-1/6 K=7, post-route. Narrowing the datapath RAISES Fmax and cuts LUT; folding does the opposite.

SO WHAT
Every row is the same correct decoder. The fold and width knobs walk a Pareto front from "fast and large" (fold 1, 160 Mbps) to "small and slow" (fold 8, 681 LUT). The rest of the report is how that front was found, and why the fold-2 row reads ~160 MHz instead of the 147 MHz it started at.
2

Scope and Parameters

What it decodes
Soft-decision Viterbi for rate 1/2 to 1/7 convolutional codes, constraint length K = 3..9 (flagship K=7, 64 states). Standards: 802.11a/g, LTE, DVB, CCSDS, Intelsat.
Interface
Streaming, not block. II = 1 at fold 1: one trellis step and one decoded bit per clock, no handshake stall. Continuous-input, self re-arming.
Knobs
fold {1,2,4,8} trades butterfly fabric for clocks-per-step; soft / metric width trades coding gain for area; traceback {windowed, block, register-exchange} trades BRAM for latency.
Reference
The reference math is ka9q / libfec viterbi27, read line by line. Polynomials 0o133 / 0o171. Survivor traceback depth 42 (the BER-cliff floor).
ParameterValueMeaning
K (constraint length)7 (3..9)2^(K-1) = 64 trellis states
SOFT_W4 (1..8)soft-decision input bits per symbol
METRIC_W9path-metric register width (re-normalized)
TB_DEPTH42traceback depth, ~6x K (BER-cliff safe)
POLY_A / POLY_B0o133 / 0o171generator polynomials (industry standard)
fold1 (deployed), 2/4/8clocks per trellis step = fold
FPGA targetxc7z010clg225-1Zynq-7010 (ADALM-Pluto class); UltraScale+ for reference
3

Algorithm Fundamentals

A convolutional encoder is a small finite-state machine: K-1 memory bits, one input bit per step, n coded outputs. The decoder must find the single most likely path through the resulting trellis given noisy soft observations. Viterbi does this exactly with one recurrence repeated per step, Add, Compare, Select (ACS), then walks the surviving decisions backwards to recover the bits.

branch metric   bm(s,b)   = sum over outputs of |soft_in-expected_bit|   (Hamming-ish distance)
ACS recurrence  m_new[s]  = min( m_old[s0] + bm0 , m_old[s1] + bm1 )       (the feedback loop)
survivor        dec[s]    = argmin of the two                              (1 bit written to memory)
traceback       bit[t]    = follow dec[] backwards TB_DEPTH steps from the best state

The ACS line is both the core and the constraint: m_new depends on m_old, so it is a feedback loop. You cannot pipeline a register into the middle of it without changing throughput, which makes the ACS cone a "recurrence" class timing problem, and it is why every clock-rate win in this project came from restructuring the work around the loop, never inside it.

ACS recurrence: add, compare by MSB, select viterbi_bfly.sv (comb) + viterbi_path_state.sv (registers the loop) path metrics metrics_o add branch metric old + bm_w[C] compare (MSB) (m1_w-m0_w)[MSB] select min dec ? m1_w : m0_w to survivor RAM decision_o = dec recurrence: new metrics feed back, cannot be pipelined
One ACS unit, one per state: 64 run every step (256 for K=9), paired into 32 radix-2 butterflies. The red loop is the recurrence that bounds the clock.

Algorithm to architecture

Math operationHardwareSilicon block
branch metric bm(s,b)small adder tree (n soft inputs)fabric LUT (feed-forward, pipelinable)
ACS min(m+bm)add-compare-select conefabric LUT + CARRY (the recurrence)
survivor decisions2^(K-1) x TB_DEPTH bit arraywrite-mirrored BRAM (2 read ports)
tracebackcounter-driven windowed read-backBRAM read + counter timeline
parallelism dialfold = clocks per steptime-multiplex one butterfly bank

No multiplies anywhere: soft metrics are absolute differences, so the whole decoder maps to LUT and BRAM with zero DSP. That is the bridge into the architecture: a feed-forward metric front end, a recurrent ACS core, and a survivor memory with a traceback reader.

4

Architecture Overview

Four composed leaves. Soft symbols enter the butterfly bank, the path-state register holds the 64 metrics and feeds them back, decisions stream into a write-mirrored survivor memory, and a counter-driven windowed-traceback controller reads them out as bits. Control travels in the same registered struct as its data (aligned-by-design), so a misbehaving stage is the bug, not a delay-chain artifact two stages downstream.

Top-level dataflow (viterbi_wtb_top.sv) A recurrent ACS core feeds a write-mirrored survivor RAM and a two-engine traceback. soft in u_bfly ACS (comb) u_path_state metrics (reg) recurrence u_mem (write-mirrored RAM) mem0 (rd0) mem1 (rd1) u_core (windowed traceback) timeline counter u_q / blk_q / ph_q 2 chase engines merge + decode reversal buf rev_buf write rd0/rd1 bits out (bit_o)
Top-level composition. The bfly + path_state pair is the recurrent ACS core; everything else is feed-forward.

Three traceback styles ship as Pareto variants on the same ACS core; pick BRAM vs latency:

TracebackBRAMLatencyLUT / FF (xczu28dr)FmaxRole
windowed24·TB+12,604 / 667433.3 MHzdeployed 802.11a (min logic)
block argmin12·TB+43,846 / 1,287422.5 MHzhalf the BRAM
register exchange0TB+1fabric survivors-lowest latency, short frames

post-route xczu28dr-2, soft4 / metric9 / TB=42. The windowed variant is the deployed flagship.

Next
That is the static architecture. Most of the optimization work happened in one module, the folded read path, covered next.
5

Micro-Architecture: the folded read engine

Click a module to inspect it. The two that matter for the clock are the recurrent ACS cone and the folded read engine that feeds it.

ACS recurrence cone the feedback loop folded read engine demux vs rotating-read traceback control windowed read-back survivor memory write-mirrored BRAM

ACS recurrence cone (the clock-bounding loop)

Add the branch metric, compare the two candidates, select the minimum, write it back as next step's metric. The loop is m+bm -> min -> register -> m. A register dropped inside it lands in the loop and cuts throughput, so it cannot be pipelined. The levers are: feed-forward the branch-metric add out of the loop (branch-metric pipelining), and keep the loop's all-to-all route physically compact.

ParameterValueNotes
Classrecurrencefeedback loop, not feed-forward
Width9 b/statepath metric, re-normalized each step
Instances64 (K=7) / 256 (K=9)one per state, fold-shared
Pipelinableno (inside loop)only feed-forward parts can be cut

Folded read engine: the centerpiece

When you fold, one butterfly bank serves several trellis positions across fold clocks, so each clock must read a different slice of the metric bank. The naive way is a phase-indexed read multiplexer: bank[cur_ph*stride + lane]. That multiplexer's select (cur_ph) fans out across the whole wide bank and sits on the read path: it caps the clock. The fix is rotating-read: rotate the buffer by the stride each phase and read a fixed tap. The phase counter no longer touches the wide read.

The folded read: the mux that capped the clock Same metric bank bank_q; only the read pattern changes. Before: cur_ph indexes the read (a mux) metric bank bank_q . 576 b read MUX bank_q[2*s_w] to ACS phase cur_ph s_w = cur_ph*BPP+l cur_ph sits on the read path, caps the clock ~143 MHz After: fixed tap, the bank rotates bank_q rotates 2·BPP / phase fixed tap bank_q[2*l] to ACS cur_ph bmc_ln[cur_ph] cur_ph only on branch-metric, ~160 MHz, LUT flat
Same bank; only the read pattern changes. The indexed read (top) keeps cur_ph on the wide read and caps the clock; the fixed tap on a rotating bank (bottom) takes it off and lifts it.
fold 2, K=7 (Zynq-7010)FmaxLUTFF
demux~143 MHz1,3451,074
rotating-read~160 MHz (+12%)1,3681,072
Design rule extracted
A fixed-stride windowed read must rotate a buffer and read a fixed tap, never a runtime phase-indexed read mux. That holds when the rotation mux is no larger than the mux it replaces. At fold 2 it replaces the commit mux exactly, so it is free; at fold ≥ 4 the rotation mux scales with the bank and the demux is the area-optimal choice. Hence the per-fold policy.

Windowed traceback controller

A DO-254 counter-driven controller reads the survivor memory back over a window of TB_DEPTH=42 decisions, emitting one decoded bit per step at steady state. A single counter timeline (position within the window, a two-bit rolling region, and a saturating startup phase) drives every control signal as a decode; there is no enumerated state machine, which keeps branchy next-state logic off the recurrence read path. Uniform latency 4·TB+1; no data-dependent timing. Outputs are registered.

ParameterValueNotes
Latency4·TB+1 = 169 cycold start; steady state 1 bit/clk
Stylecounter-driven timelinesingle counter, control derived as decodes
BRAM reads/clock2 (mirrored)one per chase engine: merge + decode

Write-mirrored survivor memory

2^(K-1) decision bits per step, inferred as block RAM via UG901 templates (no UNISIM instantiation). The windowed traceback issues two reads per clock (one for the merge engine, one for the decode engine) plus one write, which a single block RAM cannot serve, so the memory is two identical mirrors, both written every cycle, each supplying one chase read port. Addressing rolls through four regions so a write and a read normally target different regions; the one same-cycle, same-address case (a region's newest entry, read by the merge engine the cycle it is written) is forwarded by a 1-deep write-first bypass in the core, so a read-during-write never returns stale data.

ParameterValueNotes
Geometry64 b x 256 (4 regions)K=7; 256 b wide for K=9
Block2x RAMB (write-mirrored)two copies = two read ports
Collisionsame-addr R/W1-deep write-first bypass forwards the new word
6

The Optimization Journey

This is the loop AI ran on the design: folding is the headline area lever, but it appears to cost clock until AI reads the critical path and pins the loss on the read micro-architecture, not on folding. The wins came from restructuring that read, from per-lane metric select, and from bit-width, each measured rather than guessed; one alternative regressed and was reverted. Every number below is post-route on the named part.

The clock across the journey From naive folding, micro-architecture climbs back to full rate. 90 130 170 MHz full rate 160 MHz 100.5 fold 8 naive 147 fold 2 indexed 160 fold 2 rotating 160.2 fold 1 full The rotating read brings folded fold-2 back to the full-rate clock.
The clock across the journey: from naive folding to the rotating-read micro-architecture, the folded fold-2 clock climbs back to the full-rate class; folding alone did not.

Step 1: Folding alone is a weak area lever

Sharing the butterfly across 8 clocks cut LUT 31% but doubled FF and dropped Fmax 34%. The wholesale metric-bank control (a fanout-675 reset and fanout-640 commit net spanning the whole bank) floored folded Fmax around 100-120 MHz on 7-series. The fixed costs (shared branch-metric table, full-width bank) are exactly what folding cannot shrink.

Fmax -34%  |  LUT -31%  |  FF +144%  |  throughput /8

Step 2: Per-lane branch metric (remove the shared 64:1 mux)

A feasibility-ceiling probe showed removing the shared metric mux was worth ~14%, so instead of chasing the mux, each lane keeps only the few metrics it needs, selected locally. Fanout collapses and the selects shrink.

fold8 100.5 -> 123.5 MHz (+23%)  |  LUT -38%  |  FF flat
Step 3: indexed read mux to rotating fixed tap (fold 2) Before: indexed read mux phase counter on the read path After: rotating fixed tap rotate mux replaces commit mux +9% Fmax, flat LUT, −46 FF
The one structural change that broke the fold-2 ceiling, and the central claim of this report.

Step 4: Pipeline the branch metric for deep rates

At rate 1/6 the branch metric sums six soft inputs, a tree that sat combinationally in front of the ACS add and lengthened the recurrence cone. Registering that sum (a feed-forward cut, +1 cycle latency) took the tree out of the loop.

rate 1/6: 150.1 -> 163.9 MHz (+9.2%)  |  LUT flat  |  FF +200

Step 5: bit-width, the real deep-rate winner

Narrowing soft 4b/metric 11b to 3b/9b shrank both the metric table and the bank, the fixed costs folding could not touch. It raised Fmax (shorter datapath) while folding lowers it. The cost is a ~1.4x BER penalty that is still ~2x better than the rate-1/2 baseline.

rate 1/6: 150.1 -> 161.6 MHz  |  LUT -12%  |  BER 1.04e-3 -> 1.52e-3

Step 6: The dead-end (reported, not hidden)

A dual-bank ping-pong read was tried as an alternative reorder. It added a spare bank and a full-width toggle mux and still needed the parallel-window read. Measured on the target part it regressed: +86 LUT and -27 MHz versus rotating-read (bigger AND slower: the spare bank costs the area). Reverted. The negative result is what confirmed rotating-read was the right reorder placement, not a lucky one.

+86 LUT, -27 MHz vs rotating-read  |  reverted
The unifying principle (codified back into the framework)
Time-folding any radix-2 butterfly (Viterbi, FFT, a sorting network) forces a perfect-shuffle reorder: one side reads sequentially, the other strided. A reorder element is therefore unavoidable. The only question is where you pay for it. The three engines below are the three placements; the win is choosing the one whose reorder is free at your fold.
One reorder, three places to pay Pick the engine whose reorder is free at your fold. Indexed read mux reorder on the read wins at fold 4 and up best density area-optimal Rotating read reorder on the rotate wins at fold 2 free, no extra mux the deployed choice Dual-bank ping-pong reorder + wide mux never wins bigger and slower −27 MHz, reverted
The 3-engine taxonomy. The deployed policy: fold 2 rotating-read (clock-first), fold ≥ 4 demux (density-first).

Step 7: Deployment reality check (full 802.11a RX)

The fold-2 rotating-read engine was dropped into a full 802.11a receiver and run on raw samples. All 8 modulation/coding schemes decoded error-free, byte-exact to the soft reference, across three back-to-back packets with no reset between them (continuous re-arm). A fold-4 probe was rejected on a throughput-feasibility check: the hardest scheme needs 54 M trellis-steps/s and fold 4 at 147 MHz delivers only 37 M; fold 2 sustains 80 M. The standalone +13 MHz banks as slack in the receiver, whose binding path is the equalizer, not the decoder.

8/8 MCS error-free  |  +0.6 MHz on the worst path  |  -95 failing endpoints  |  resources flat
7

Resource and Trade-off Comparison

Versus the Xilinx PG027 Viterbi IP at comparable settings (soft 3, TB=42): the generated windowed decoder is competitive on logic and uses far fewer flip-flops, at zero DSP.

DesignLUTFFBRAMDSPNotes
Xilinx PG027 (soft3, TB=42)2,2601,72820configurable IP
This work, windowed (soft4, TB=42)2,60466720post-route xczu28dr-2
The same decoder, a Pareto front K=9 on UltraScale+: smaller fold is faster and larger. Fmax (MHz) logic (LUT) 300 340 380 fold 8 indexed 2,627 LUT / 322 fold 4 indexed 5,116 LUT / 323 fold 2 rotating 9,113 LUT / 351 Each clock is the median of a placement sweep. Every point is verified.
Median-of-sweep Fmax vs post-route LUT. Each point is a valid shipped configuration.
8

Verification

Correctness is a chain of bit-exact equalities, each layer proven against the one above it. The RTL is never the reference: it adapts to a cycle model that adapts to a math model that is proven against an independent library.

Three models, proven equal Each arrow is a bit-exact (0-LSB) match on the hardest inputs. Math reference maximum-likelihood (libfec / ka9q) Cycle-accurate model per clock, mirrors fold + saturate Verilog RTL K=7/9 x fold 1/2/4/8 validates validates
Three-model hierarchy. Each arrow is a bit-exact (zero-bit) equality on hardest-input vectors.
Math = model
The cycle-accurate model is bit-exact to the maximum-likelihood reference (libfec / ka9q) on impulse, all-zero, and hardest random codewords.
Model = RTL
0-LSB across K=7/9 x fold 1/2/4/8; value-transparent through the fold reorder.
Continuous re-arm
3 back-to-back packets, init held inactive after cold start, full bit count, 0 errors.
DO-254
Strict RTL coding discipline: zero violations across every generated module and fold factor.
SNR (BPSK / AWGN)K=7 BERK=9 BERCoding gain
1 dB6.24e-26.19e-2-
2 dB1.12e-25.39e-3K=9 ~2x better than K=7
3 dB8.54e-4< measurement floor-

BER waterfall, soft 4b, rate 1/2. K=9 buys coding gain at 4x the states: the constraint-length knob.

9

Design Guidelines

1: Measure the ceiling, do not guess it
A feasibility-ceiling probe told us the shared-mux removal was worth ~14%, so we restructured per-lane instead and got 23%. Probes redirect effort to the real lever.
2: A folded butterfly owes a reorder
You cannot avoid the perfect shuffle. Choose where to pay: rotate (free at fold 2) or read-mux (free at fold ≥ 4). Never a fixed-stride runtime index mux on the read.
3: Restructure around a recurrence, not inside it
The ACS loop cannot take a pipeline register. Every clock win came from feed-forwarding work out of the loop (branch-metric pipelining) or compacting its route.
4: Bit-width before folding for area
Narrowing shrinks the fixed costs folding cannot touch AND raises Fmax. Folding raises FF and lowers Fmax. Reach for width first.
5: Fmax is a distribution
Report the median of a place-directive sweep for route-bound cones, not a single lucky run. The K=9 fold-2 point spans 313-371 MHz across directives.
6: Publish the dead-ends
The dual-bank regression is what proved rotating-read was the right placement. A negative result is a first-class output.
10

Conclusion and Outlook

The recommended deployment is the windowed, fold-1 K=7 decoder at 160 MHz / 160 Mbps on Zynq-7010, the configuration shipped in a full 802.11a receiver, error-free across all eight schemes, at 1,918 LUT, 2 BRAM and zero DSP. Where area is the binding constraint, the fold-2 rotating-read variant holds the same clock class at smaller area; for many slow channels, fold 8 reaches 681 LUT. The same RTL closes at 433 MHz on UltraScale+.

The reusable part is wider than this one decoder, and wider than the one fix. It is the AI optimization loop: read the physical reports, separate a folding problem from a micro-architecture one, place the unavoidable reorder where it is free, build-measure-revert the dead ends, and prove every change bit-exact. The same reorder governs any time-folded radix-2 butterfly, so rotating a buffer and reading a fixed tap instead of indexing a runtime mux is the same move that helps a folded FFT or a folded sorting network. This decoder is where that loop was run and proven on real silicon, and the rule now lives in the framework for the next design.

In one line
Folding is an efficient area architecture; whether it keeps its clock is a micro-architecture decision. AI cut the area roughly fourfold and held the clock by placing the fold's reorder where it is free. Measured: fold 2, 147 to ~160 MHz, flat area, shipped.

APPENDIX ADesign Methodology