AI-driven FPGA resource optimization of a Viterbi decoder
A soft-decision Viterbi decoder, generated from a model proven equal to the math, then handed to an AI to
optimize its FPGA resources: fold the area down, diagnose that the lost clock is a micro-architecture problem,
not a folding one, and prove every change bit-exact. Deployed in a full 802.11a receiver.
160 MHz
Zynq-7010, 1 bit/clock
1,918
LUT (K=7, fold 1)
2 BRAM
0 DSP
8 / 8 MCS
error-free in RX PHY
Convolutional forward error correction sits in the data path of nearly every wireless standard, and the hard
engineering is not the decoding but fitting the decoder onto a cheap, small FPGA: an FPGA resource-optimization
problem in LUTs, flip-flops and block RAMs. This report covers a soft-decision Viterbi decoder built the
model-first way (a Python reference proven against the math, a cycle-accurate model proven against the reference,
and Verilog proven bit-exact against the model), then handed to an AI to optimize its FPGA resources the way an
engineer would: read the reports, pick a lever, measure, tell the right fix from the wrong one, back out of dead
ends, and verify every change bit-exact. The headline result is that folding cuts the area roughly fourfold while
the clock holds, once the read micro-architecture is right.
WHY FOLDING IS THE RIGHT LEVER (motivation)
A fixed application has a throughput ceiling, and the decoder only has to meet it. 802.11a tops out at 54 Mbps,
while a full-rate Viterbi at 160 MHz emits one bit per clock (near 160 Mbps), almost 3x more than the standard
asks. That spare clock is exactly what folding spends: time-multiplex the compute over several cycles instead of
laying down a fully parallel engine, and as long as throughput stays above 54 Mbps the reclaimed silicon is free.
Size the hardware to the required throughput, not the maximum achievable one.
THE CENTRAL CLAIM
Folding is an efficient area architecture, not a clock penalty. Whether a folded decoder keeps its clock is
decided by the FPGA micro-architecture: place the reorder a fold imposes where it is free, by rotating the buffer
and reading a fixed tap, and the folded engine keeps a single engine's clock at a folded engine's area.
1
Results at a Glance
▼
160.2MHz
Fmax (deployed)
K=7 fold 1, xc7z010-1, post-route, target 160 met
1,918LUT
Logic
740 FF, 2 BRAM, 0 DSP
160Mbps
Throughput
II = 1, one decoded bit per clock
433MHz
UltraScale+ reference
xczu28dr-2, same RTL, 2.4 ns target
0LSB
RTL vs model
bit-exact across K=7/9 x fold 1/2/4/8
THE TENSION THIS REPORT RESOLVES
Decoding throughput scales with the trellis: doubling the constraint length doubles the butterflies. To fit a
small part you fold (share one butterfly bank across several clocks), but folding a butterfly forces a data
reorder, and the obvious way to do the reorder (a phase-indexed read multiplexer) sits squarely on the timing
path and caps the clock. So how do you fold for area without paying for it in megahertz?
Three views of the same generated RTL. The deployed target is the 7-series Zynq; the UltraScale+
column is the same source at a tighter clock target, and the deep-rate column shows the family stretched to
rate 1/6.
Fold
Read engine
Fmax
LUT
Throughput
Use when
1
unfolded
160.2 MHz
1,918
160 Mbps
clock + throughput bound
2
rotating-read
~160 MHz
1,368
80 Mbps
balanced
4
demux
147 MHz
980
~37 Mbps
area bound
8
demux
147 MHz
681
~18 Mbps
many slow channels
xc7z010clg225-1, post-route. fold-1 from fold1_ws_final; fold 2/4/8 LUT post-route from the fold finding.
Fold
Read engine
Fmax (median of 3-place sweep)
LUT
FF
1
unfolded
(published)
10,885
405
2
rotating-read
350.8 MHz (312.8 / 350.8 / 371.2)
9,113
4,536
4
demux
322.8 MHz (289.9 / 322.8 / 336.5)
5,116
5,241
8
demux
322.0 MHz (316.9 / 322.0 / 347.3)
2,627
5,601
xczu28dr-ffvg1517-2-e, 2.4 ns target, K=9 (256 states). Fmax is the median of an AltSpreadLogic / Explore / ExtraPostPlacementOpt place sweep.
Soft / metric width
Fmax
LUT
FF
BER @ 2.5 dB
Note
4b / 11b (baseline)
150.1 MHz
2,700
797
1.04e-3
rate 1/6, fold 1
4b / 10b (free)
144.4 MHz
2,573
731
bit-exact
metric-width floor
3b / 9b
161.6 MHz
2,381
676
1.52e-3
~1.4x worse BER, -12% LUT
xc7z010-1, rate-1/6 K=7, post-route. Narrowing the datapath RAISES Fmax and cuts LUT; folding does the opposite.
SO WHAT
Every row is the same correct decoder. The fold and width knobs walk a Pareto front from "fast and large"
(fold 1, 160 Mbps) to "small and slow" (fold 8, 681 LUT). The rest of the report is how that front was found,
and why the fold-2 row reads ~160 MHz instead of the 147 MHz it started at.
2
Scope and Parameters
▼
What it decodes
Soft-decision Viterbi for rate 1/2 to 1/7 convolutional codes, constraint length
K = 3..9 (flagship K=7, 64 states). Standards: 802.11a/g, LTE, DVB, CCSDS, Intelsat.
Interface
Streaming, not block. II = 1 at fold 1: one trellis step and one decoded bit per clock,
no handshake stall. Continuous-input, self re-arming.
Knobs
fold {1,2,4,8} trades butterfly fabric for clocks-per-step; soft / metric width
trades coding gain for area; traceback {windowed, block, register-exchange} trades BRAM for latency.
Reference
The reference math is ka9q / libfec viterbi27, read line by line. Polynomials 0o133 / 0o171.
Survivor traceback depth 42 (the BER-cliff floor).
Parameter
Value
Meaning
K (constraint length)
7 (3..9)
2^(K-1) = 64 trellis states
SOFT_W
4 (1..8)
soft-decision input bits per symbol
METRIC_W
9
path-metric register width (re-normalized)
TB_DEPTH
42
traceback depth, ~6x K (BER-cliff safe)
POLY_A / POLY_B
0o133 / 0o171
generator polynomials (industry standard)
fold
1 (deployed), 2/4/8
clocks per trellis step = fold
FPGA target
xc7z010clg225-1
Zynq-7010 (ADALM-Pluto class); UltraScale+ for reference
3
Algorithm Fundamentals
▼
A convolutional encoder is a small finite-state machine: K-1 memory bits, one input bit per step,
n coded outputs. The decoder must find the single most likely path through the resulting trellis given noisy
soft observations. Viterbi does this exactly with one recurrence repeated per step, Add, Compare,
Select (ACS), then walks the surviving decisions backwards to recover the bits.
branch metric bm(s,b) = sum over outputs of |soft_in-expected_bit| (Hamming-ish distance)
ACS recurrence m_new[s] = min( m_old[s0] + bm0 , m_old[s1] + bm1 ) (the feedback loop)
survivor dec[s] = argmin of the two (1 bit written to memory)
traceback bit[t] = follow dec[] backwards TB_DEPTH steps from the best state
The ACS line is both the core and the constraint: m_new depends on m_old,
so it is a feedback loop. You cannot pipeline a register into the middle of it without changing throughput, which
makes the ACS cone a "recurrence" class timing problem, and it is why every clock-rate win in this project came
from restructuring the work around the loop, never inside it.
One ACS unit, one per state: 64 run every step (256 for K=9), paired into 32 radix-2 butterflies. The red loop is the recurrence that bounds the clock.
Algorithm to architecture
Math operation
Hardware
Silicon block
branch metric bm(s,b)
small adder tree (n soft inputs)
fabric LUT (feed-forward, pipelinable)
ACS min(m+bm)
add-compare-select cone
fabric LUT + CARRY (the recurrence)
survivor decisions
2^(K-1) x TB_DEPTH bit array
write-mirrored BRAM (2 read ports)
traceback
counter-driven windowed read-back
BRAM read + counter timeline
parallelism dial
fold = clocks per step
time-multiplex one butterfly bank
No multiplies anywhere: soft metrics are absolute differences, so the
whole decoder maps to LUT and BRAM with zero DSP. That is the bridge into the architecture: a
feed-forward metric front end, a recurrent ACS core, and a survivor memory with a traceback reader.
4
Architecture Overview
▼
Four composed leaves. Soft symbols enter the butterfly bank, the path-state register holds the 64
metrics and feeds them back, decisions stream into a write-mirrored survivor memory, and a counter-driven
windowed-traceback controller reads them out as bits. Control travels in the same registered struct as its data
(aligned-by-design), so a misbehaving stage is the bug, not a delay-chain artifact two stages downstream.
Top-level composition. The bfly + path_state pair is the recurrent ACS core; everything else is feed-forward.
Three traceback styles ship as Pareto variants on the same ACS core; pick BRAM vs latency:
Traceback
BRAM
Latency
LUT / FF (xczu28dr)
Fmax
Role
windowed
2
4·TB+1
2,604 / 667
433.3 MHz
deployed 802.11a (min logic)
block argmin
1
2·TB+4
3,846 / 1,287
422.5 MHz
half the BRAM
register exchange
0
TB+1
fabric survivors
-
lowest latency, short frames
post-route xczu28dr-2, soft4 / metric9 / TB=42. The windowed variant is the deployed flagship.
Next
That is the static architecture. Most of the
optimization work happened in one module, the folded read path, covered next.
5
Micro-Architecture: the folded read engine
▼
Click a module to inspect it. The two that matter for the clock are the recurrent ACS cone and the
folded read engine that feeds it.
ACS recurrence cone (the clock-bounding loop)
Add the branch metric, compare the two candidates, select the minimum, write it back as next
step's metric. The loop is m+bm -> min -> register -> m. A register dropped inside it lands in
the loop and cuts throughput, so it cannot be pipelined. The levers are: feed-forward the branch-metric add
out of the loop (branch-metric pipelining), and keep the loop's all-to-all route physically compact.
Parameter
Value
Notes
Class
recurrence
feedback loop, not feed-forward
Width
9 b/state
path metric, re-normalized each step
Instances
64 (K=7) / 256 (K=9)
one per state, fold-shared
Pipelinable
no (inside loop)
only feed-forward parts can be cut
Folded read engine: the centerpiece
When you fold, one butterfly bank serves several trellis positions across fold
clocks, so each clock must read a different slice of the metric bank. The naive way is a
phase-indexed read multiplexer: bank[cur_ph*stride + lane]. That multiplexer's
select (cur_ph) fans out across the whole wide bank and sits on the read path: it caps the clock.
The fix is rotating-read: rotate the buffer by the stride each phase and read a
fixed tap. The phase counter no longer touches the wide read.
Same bank; only the read pattern changes. The indexed read (top) keeps cur_ph on the wide read and caps the clock; the fixed tap on a rotating bank (bottom) takes it off and lifts it.
fold 2, K=7 (Zynq-7010)
Fmax
LUT
FF
demux
~143 MHz
1,345
1,074
rotating-read
~160 MHz (+12%)
1,368
1,072
Design rule extracted
A fixed-stride windowed read must rotate a buffer and read a fixed tap, never a runtime
phase-indexed read mux. That holds when the rotation mux is no larger than the mux it replaces. At fold 2 it
replaces the commit mux exactly, so it is free; at fold ≥ 4 the rotation mux scales with the bank and the
demux is the area-optimal choice. Hence the per-fold policy.
Windowed traceback controller
A DO-254 counter-driven controller reads the survivor memory back over a window of TB_DEPTH=42
decisions, emitting one decoded bit per step at steady state. A single counter timeline (position within the
window, a two-bit rolling region, and a saturating startup phase) drives every control signal as a decode;
there is no enumerated state machine, which keeps branchy next-state logic off the recurrence read path.
Uniform latency 4·TB+1; no data-dependent timing. Outputs are registered.
Parameter
Value
Notes
Latency
4·TB+1 = 169 cy
cold start; steady state 1 bit/clk
Style
counter-driven timeline
single counter, control derived as decodes
BRAM reads/clock
2 (mirrored)
one per chase engine: merge + decode
Write-mirrored survivor memory
2^(K-1) decision bits per step, inferred as block RAM via UG901 templates (no UNISIM
instantiation). The windowed traceback issues two reads per clock (one for the merge engine, one for the
decode engine) plus one write, which a single block RAM cannot serve, so the memory is two identical
mirrors, both written every cycle, each supplying one chase read port. Addressing rolls through four regions
so a write and a read normally target different regions; the one same-cycle, same-address case (a region's
newest entry, read by the merge engine the cycle it is written) is forwarded by a 1-deep write-first bypass
in the core, so a read-during-write never returns stale data.
Parameter
Value
Notes
Geometry
64 b x 256 (4 regions)
K=7; 256 b wide for K=9
Block
2x RAMB (write-mirrored)
two copies = two read ports
Collision
same-addr R/W
1-deep write-first bypass forwards the new word
6
The Optimization Journey
▼
This is the loop AI ran on the design: folding is the headline area lever, but it appears to cost
clock until AI reads the critical path and pins the loss on the read micro-architecture, not on folding. The
wins came from restructuring that read, from per-lane metric select, and from bit-width, each measured rather
than guessed; one alternative regressed and was reverted. Every number below is post-route on the named part.
The clock across the journey: from naive folding to the rotating-read micro-architecture, the folded fold-2 clock climbs back to the full-rate class; folding alone did not.
Step 1: Folding alone is a weak area lever
Sharing the butterfly across 8 clocks cut LUT 31% but doubled FF and dropped Fmax 34%. The
wholesale metric-bank control (a fanout-675 reset and fanout-640 commit net spanning the whole bank) floored
folded Fmax around 100-120 MHz on 7-series. The fixed costs (shared branch-metric table, full-width bank) are
exactly what folding cannot shrink.
Fmax -34% | LUT -31% | FF +144% | throughput /8
Step 2: Per-lane branch metric (remove the shared 64:1 mux)
A feasibility-ceiling probe showed removing the shared metric mux was worth ~14%, so instead of
chasing the mux, each lane keeps only the few metrics it needs, selected locally. Fanout collapses and the
selects shrink.
The one structural change that broke the fold-2 ceiling, and the central claim of this report.
Step 4: Pipeline the branch metric for deep rates
At rate 1/6 the branch metric sums six soft inputs, a tree that sat combinationally in front of
the ACS add and lengthened the recurrence cone. Registering that sum (a feed-forward cut, +1 cycle latency)
took the tree out of the loop.
Narrowing soft 4b/metric 11b to 3b/9b shrank both the metric table and the bank, the fixed costs
folding could not touch. It raised Fmax (shorter datapath) while folding lowers it. The cost is a
~1.4x BER penalty that is still ~2x better than the rate-1/2 baseline.
A dual-bank ping-pong read was tried as an alternative reorder. It added a spare bank and a
full-width toggle mux and still needed the parallel-window read. Measured on the target part it
regressed: +86 LUT and -27 MHz versus rotating-read (bigger AND slower: the spare bank
costs the area). Reverted. The negative result is what
confirmed rotating-read was the right reorder placement, not a lucky one.
+86 LUT, -27 MHz vs rotating-read | reverted
The unifying principle (codified back into the framework)
Time-folding any radix-2 butterfly (Viterbi, FFT, a sorting network) forces a perfect-shuffle
reorder: one side reads sequentially, the other strided. A reorder element is therefore unavoidable.
The only question is where you pay for it. The three engines below are the three placements; the win is
choosing the one whose reorder is free at your fold.
The 3-engine taxonomy. The deployed policy: fold 2 rotating-read (clock-first), fold ≥ 4 demux (density-first).
The fold-2 rotating-read engine was dropped into a full 802.11a receiver and run on raw samples.
All 8 modulation/coding schemes decoded error-free, byte-exact to the soft reference, across three back-to-back
packets with no reset between them (continuous re-arm). A fold-4 probe was rejected on a throughput-feasibility
check: the hardest scheme needs 54 M trellis-steps/s and fold 4 at 147 MHz delivers only 37 M; fold 2 sustains
80 M. The standalone +13 MHz banks as slack in the receiver, whose binding path is the equalizer, not the
decoder.
8/8 MCS error-free | +0.6 MHz on the worst path | -95 failing endpoints | resources flat
7
Resource and Trade-off Comparison
▼
Versus the Xilinx PG027 Viterbi IP at comparable settings (soft 3, TB=42): the generated windowed
decoder is competitive on logic and uses far fewer flip-flops, at zero DSP.
Design
LUT
FF
BRAM
DSP
Notes
Xilinx PG027 (soft3, TB=42)
2,260
1,728
2
0
configurable IP
This work, windowed (soft4, TB=42)
2,604
667
2
0
post-route xczu28dr-2
Median-of-sweep Fmax vs post-route LUT. Each point is a valid shipped configuration.
8
Verification
▼
Correctness is a chain of bit-exact equalities, each layer proven against the one above it. The RTL
is never the reference: it adapts to a cycle model that adapts to a math model that is proven against an
independent library.
Three-model hierarchy. Each arrow is a bit-exact (zero-bit) equality on hardest-input vectors.
Math = model
The cycle-accurate model is bit-exact to the maximum-likelihood reference (libfec / ka9q) on impulse, all-zero, and hardest random codewords.
Model = RTL
0-LSB across K=7/9 x fold 1/2/4/8; value-transparent through the fold reorder.
Continuous re-arm
3 back-to-back packets, init held inactive after cold start, full bit count, 0 errors.
DO-254
Strict RTL coding discipline: zero violations across every generated module and fold factor.
SNR (BPSK / AWGN)
K=7 BER
K=9 BER
Coding gain
1 dB
6.24e-2
6.19e-2
-
2 dB
1.12e-2
5.39e-3
K=9 ~2x better than K=7
3 dB
8.54e-4
< measurement floor
-
BER waterfall, soft 4b, rate 1/2. K=9 buys coding gain at 4x the states: the constraint-length knob.
9
Design Guidelines
▼
1: Measure the ceiling, do not guess it
A feasibility-ceiling probe told us the shared-mux removal was worth ~14%, so we restructured per-lane instead and got 23%. Probes redirect effort to the real lever.
2: A folded butterfly owes a reorder
You cannot avoid the perfect shuffle. Choose where to pay: rotate (free at fold 2) or read-mux (free at fold ≥ 4). Never a fixed-stride runtime index mux on the read.
3: Restructure around a recurrence, not inside it
The ACS loop cannot take a pipeline register. Every clock win came from feed-forwarding work out of the loop (branch-metric pipelining) or compacting its route.
4: Bit-width before folding for area
Narrowing shrinks the fixed costs folding cannot touch AND raises Fmax. Folding raises FF and lowers Fmax. Reach for width first.
5: Fmax is a distribution
Report the median of a place-directive sweep for route-bound cones, not a single lucky run. The K=9 fold-2 point spans 313-371 MHz across directives.
6: Publish the dead-ends
The dual-bank regression is what proved rotating-read was the right placement. A negative result is a first-class output.
10
Conclusion and Outlook
▼
The recommended deployment is the windowed, fold-1 K=7 decoder at 160 MHz / 160 Mbps on Zynq-7010,
the configuration shipped in a full 802.11a receiver, error-free across all eight schemes, at 1,918 LUT, 2 BRAM
and zero DSP. Where area is the binding constraint, the fold-2 rotating-read variant holds the same clock class
at smaller area; for many slow channels, fold 8 reaches 681 LUT. The same RTL closes at 433 MHz on UltraScale+.
The reusable part is wider than this one decoder, and wider than the one fix. It is the AI
optimization loop: read the physical reports, separate a folding problem from a micro-architecture one, place
the unavoidable reorder where it is free, build-measure-revert the dead ends, and prove every change bit-exact.
The same reorder governs any time-folded radix-2 butterfly, so rotating a buffer and reading a fixed tap instead
of indexing a runtime mux is the same move that helps a folded FFT or a folded sorting network. This decoder is
where that loop was run and proven on real silicon, and the rule now lives in the framework for the next design.
In one line
Folding is an efficient area architecture; whether it keeps its clock is a micro-architecture decision. AI cut
the area roughly fourfold and held the clock by placing the fold's reorder where it is free. Measured: fold 2,
147 to ~160 MHz, flat area, shipped.
▼
APPENDIX ADesign Methodology
The principles that made the journey repeatable, with the failure each one prevents.
A. The model is the reference (model-first)
All structural changes (per-lane metric select, rotating read, branch-metric pipelining) were made first in the cycle-accurate model,
proven bit-exact there, then translated to Verilog and re-proven. The RTL is never edited to chase a number; the
model is the hardware truth. Prevents: model divergence that hides fold reorder bugs.
B. Feasibility ceiling before build
Before implementing any cited mechanism, measure a baseline, a ceiling, and the simpler
alternatives first. The shared-mux ceiling probe (worth ~14%) redirected the effort to the per-lane
restructure (worth 23%). A probe is representative, not exhaustive: it measures an ideal structure, so it can
under-estimate; build the real thing when the target structure is clean.
C. Fmax is a distribution (A/B at the same target)
Route-bound cones vary 80-200 ps across place directives. Every fold Fmax in this report is the
median of a 3-directive sweep at the same clock target. Comparing two engines means median-vs-median, same part,
same target, same session: never a single run against a differently-measured baseline.
D. Recurrence cone levers
A feedback-loop cone (ACS, a non-restoring divider) cannot be fixed by a mid-route register (it
lands in the loop) or a floorplan alone. The levers, in order: re-read the SOTA reference (the loop may be a
wrong-algorithm artifact); feed-forward the non-recurrent parts around it (branch-metric pipelining); or split it into a slower
clock domain. This project used the first two.
E. Hardest input, continuous operation
Test vectors include the hardest realistic codewords and valid-gap stimulus, and a continuous-input
design must self-re-arm with no reset between operations. The 802.11a integration ran three back-to-back packets
with init held inactive to prove the re-arm: a reset-between-packets harness would have hidden a re-arm
dependency.
F. Inherit and compose verified IP
Each RTL module is generated from a parameterized, separately verified building block, and the fold
engines are checked bit-exact against the reference. A correction to one engine
propagates to every design that composes it: the fold-2 engine the 802.11a receiver inherited is the same one
proven in the standalone decoder.