IP CORE · AI-OPTIMIZED

Viterbi Decoder

A soft-decision Viterbi decoder for the convolutional codes that front almost every wireless and satellite link: Wi-Fi, DVB, 3GPP/LTE, WiMAX, and CCSDS deep-space. Generated from a Python algorithm and resource-optimized by AI: folding shrinks the logic to fit a small FPGA, and the right micro-architecture keeps the clock. The result is a whole family of resource-versus-clock operating points, from full-rate down to one quarter of the logic, each verified bit-for-bit, at zero DSP.

FLAGSHIP

K=7 rate-1/2, placed and routed on UltraScale+

The flagship configuration is the constraint-length-7, rate-1/2 soft-decision decoder (the 802.11a / CCSDS / IEEE standard code). It was placed and routed on a Zynq UltraScale+ RFSoC device; every figure below comes from Vivado Design Timing Summary and post-route utilization reports.

527 MHz
Honest-ceiling clock (median of a place-directive sweep), Zynq UltraScale+ (xczu28dr) MEASURED
527 Mbps
One decoded bit per clock at the closed clock, continuous streaming MEASURED
0 DSP
Zero DSP48 blocks: the whole decoder lives in logic and block RAM MEASURED
2,604 LUT
667 flip-flops, 2 block RAMs: a compact, predictable footprint MEASURED

Measured on two devices, both real silicon

The same generated core was implemented on a high-performance UltraScale+ part and on the low-cost Zynq-7010 (the ADALM-Pluto SDR device). We publish both, including the speed the low-cost part actually reaches:

ResultValueProvenance
K=7 closed clock, Zynq UltraScale+ (xczu28dr-2)527 MHzVivado P&R honest ceiling, median of place sweep MEASURED
K=7 closed clock, low-cost Zynq-7010 (-1)162 MHzVivado P&R honest ceiling, clears 160 MHz MEASURED
K=7 footprint, Zynq-70101,918 LUT / 740 FF / 2 BRAM / 0 DSPPost-route utilization MEASURED
Half-hardware layout (fold 2), Zynq-70101,367 LUT @ 160 MHz-29% logic at the full-rate clock MEASURED
K=9 256-state decoder, UltraScale+405 MHz / 11,326 LUT / 8 BRAMVivado P&R honest ceiling, stronger coding gain MEASURED

Open vs encrypted: where the common commercial Viterbi core ships as encrypted RTL, our decoder ships as readable, synthesizable source that is verified bit-for-bit against a reference implementation (ka9q/libfec, the decoder behind gr-ieee802-11), so you can audit, port, and trust every gate.

Fold knob: trade logic for throughput

A single fold parameter time-multiplexes the add-compare-select engine across F clocks, shrinking the butterfly logic for channels that do not need full line rate. The 256-state K=9 core, on UltraScale+:

K=9 configurationLUTClockThroughput
fold 1, full rate (1 bit/clock)10,885405 MHz405 Mbps MEASURED
fold 2, rotating-read (clock-first)9,113351 MHz175 Mbps MEASURED
fold 45,116323 MHz81 Mbps MEASURED
fold 8, smallest area2,627322 MHz40 Mbps MEASURED

Two fold engines, picked per fold: fold 2 uses the rotating-read engine (no select mux on the metric read) to hold a near-full-rate clock: 351 MHz at half the butterfly hardware; fold 4/8 use the area-optimal engine for density, shrinking the 256-state core 76% (10,885 -> 2,627 LUT) at fold 8 so many low-rate channels fit one device. The K=7 (64-state) core folds the same way: 2,301 -> 737 LUT (fold 1 -> 8) on Zynq-7010, 527 -> 347 MHz on UltraScale+. Every fold point is verified bit-for-bit against the reference decoder. Clock is the median of a place-directive sweep; LUT and clock are both post-place-and-route.

Folding without the clock penalty. On the low-cost Zynq-7010 the half-hardware layout (fold 2) now closes at 160 MHz, the same full-rate clock as the unfolded rate-1/2 decoder, so a folded channel keeps full throughput-per-clock while spending half the add-compare-select logic (1,367 LUT / 1,073 FF). A streaming read pattern (a fixed window over a rotating metric bank) removes the per-step selection that used to hold a time-folded clock about 13 MHz below the full decoder. Bit-exact against the reference, and validated in a full Wi-Fi receiver across every modulation and coding rate, packet after packet with no reset between them. MEASURED Read the case study.

Code rates 1/2 through 1/7, one core

The decoder spans the full convolutional-FEC rate range. Lower rates add redundancy for more coding gain (a larger free distance); the same architecture serves them all at a near-constant clock. Higher rates (2/3, 3/4) are reached by puncturing the rate-1/2 base, so they reuse the rate-1/2 core unchanged.

Code rateFree distanceClock, Zynq-7010Clock, UltraScale+
1/2 (base code)10160 MHz MEASURED527 MHz MEASURED
1/315167 MHz MEASURED449 MHz MEASURED
1/420162 MHz MEASURED449 MHz MEASURED
1/524163 MHz MEASURED433 MHz MEASURED
1/630164 MHz MEASURED454 MHz MEASURED
1/733163 MHz MEASURED442 MHz MEASURED

Higher free distance means stronger error correction: at the same signal level the 1/3 code roughly halves the residual errors of 1/2, and the lower rates push further. Each rate is verified bit-for-bit against the reference decoder, zero DSP, two block RAMs. A branch-metric pipeline keeps the deep rates (1/4 and below) at full clock on the low-cost part. Zynq-7010 deep-rate clocks are the deployable pipelined configuration; UltraScale+ figures are single-run place-and-route (the rate-1/2 527 MHz is the place-sweep ceiling).

CATALOG

Configurations

K=7 rate-1/2 (the standard code)

Wi-Fi · DVB · LTE · CCSDS

The constraint-length-7, 64-state soft-decision decoder. Higher code rates (2/3, 3/4) are supported by a depuncture front-end, so the core stays rate-agnostic. One decoded bit per clock, continuous streaming.

Clock (UltraScale+)527 MHz MEASURED
Logic / DSP2,604 LUT / 0 DSP MEASURED

K=9 256-state

stronger coding gain

The constraint-length-9, 256-state code for links that need roughly an extra decibel of coding gain. Generated from the same architecture: the state count is a single parameter, verified bit-exact end to end.

Clock (UltraScale+)405 MHz MEASURED
Block RAM8 MEASURED

Three latency / area layouts

pick your trade

The survivor-path memory ships in three forms: a low-latency register-exchange layout, a uniform-latency windowed layout (2 block RAMs), and a half-block-RAM layout, covering minimum latency, minimum area, and the balanced middle.

Latency (windowed)4×traceback + 1
Initiation intervalII = 1

Continuous free-running

no per-frame reset

One cold start, then the decoder runs a continuous packet stream with no reset between frames. It re-anchors on each frame's tail bits on its own. It also tolerates idle gaps between packets, freezing and resuming with no loss. Simpler to integrate: no per-frame init handshake.

Frames back-to-back, no resetbit-exact MEASURED
Idle gaps between packetstolerated MEASURED

K=9 pulls ahead of K=7, and the lead grows with signal quality

The longer constraint length gives the K=9 code a larger free distance, so its error-rate waterfall is steeper: as the signal improves, the K=9 decoder's margin over K=7 widens. Measured over matched traffic, rate-1/2 BPSK on an AWGN channel with 4-bit soft decisions:

Eb/N0K=7 bit-error rateK=9 bit-error rateK=9 advantage
2.0 dB1.1×10-26.2×10-31.8× fewer errors MEASURED
2.5 dB3.1×10-31.2×10-32.7× fewer errors MEASURED

Both codes already beat uncoded transmission by a wide margin: the classic rate-1/2 waterfall. K=9 then roughly halves K=7's residual errors at 2 dB, and does better still as the channel improves: the extra coding gain its constraint length is meant to deliver, confirmed bit by bit. Need a different constraint length, polynomial, soft-bit width, or traceback depth? See IP customization.

Verified, not asserted

Every decoded bit is checked against a reference model and a cycle-accurate model before it is checked in hardware. The clock and resource figures here are post-route Vivado results on the stated devices, not estimates. The full RTL micro-architecture and the measured trade space are in the technical report.

Request datasheet & evaluation