Design retrospective

A Wi-Fi receiver, generated from a model and split across three processors to fit a low-cost radio

An 802.11a receiver was generated as bit-exact hardware from an executable model, then deployed on a low-cost software-defined radio. Because the radio's chip is small, only the rate-critical synchronization front-end (packet detection, frequency-offset correction, the transform, and a sample buffer) runs on the FPGA; channel estimation, equalization, soft demapping, and convolutional decoding run in software on the host. The work is split across three tiers: the FPGA fabric, the radio's embedded processor, and the host. End to end it recovers a standard waveform with zero bit errors, runs every data rate from BPSK to 64-QAM, and decodes an unbroken stream of packets over the air. This report is the honest engineering story: how the work was split across three processors, how it was brought up one rung at a time, and the two bugs that only showed up on real silicon, one of which no behavioral simulation could see.

Results at a glance

Fast, small, and verified

0errors
Bit-exact recovery
standard waveform, 38,400 bits MEASURED
162.7MHz
Receive clock
place-and-route closed, Zynq-7010 MEASURED
8 / 8
Data rates bit-exact
BPSK through 64-QAM MEASURED
100%
Continuous yield
every packet, 90 s run MEASURED
3tiers
Split to fit the chip
FPGA, processor, host
The central claim: generate the receiver from a model that is proven equal to the standard, split it across three tiers (the FPGA, the radio's embedded processor, and the host) so it fits a tiny low-cost part, and bring it up rung by rung, so the two bugs that survive to silicon stay findable, not fatal.
ResultValueWhere it was measured
Standard waveform recovered bit-for-bit0 errors / 38,400 bitsbit-level cross-validation MEASURED
Receive clock, Zynq-7010 (the radio's chip)162.7 MHzplace-and-route closure; 100 MHz on the board MEASURED
Receiver PHY, synthesized (would not fit the fabric with the radio IP)8,512 LUT / 8,233 FFutilization report MEASURED
Radio interface IP vs device budget8,459 of 17,600 LUTforces the sync-only split MEASURED
Delivered image over the air, every data rate100% bit-exactretransmit + checksum, all 8 modes MEASURED
Raw per-packet over the air (a channel hit, not a host drop)99.88%831 / 832, 0 host stalls MEASURED
Three percentages, kept apart. The delivered image is 100% bit-exact on every rate (each fragment is sent many times and a checksum picks a clean copy). The 99.88% is the raw per-packet rate (831/832); the one miss was a real wireless-channel hit, logged with zero host stalls, so it was not a host drop, and the protocol made it up. Over a cable even the raw rate is 100% (1390/1390). The host data-path loss below (7.5-15.9% to 0.2-1.1%) is a different thing again: dropped sample blocks under load, not an RF error, and with retransmission it never reaches the delivered image.

Those are the numbers. Landing them meant fitting the receiver onto a chip that, on its own, the full design does not fit on.

The approach

One design, split across three processors

The radio's chip pairs a small FPGA (the programmable logic, PL) with an embedded ARM processor running Linux (the processing system, PS), and the host computer sits on the other end of a USB link. The full receiver plus the radio's own interface logic does not fit the FPGA: the receiver PHY alone is about 8,500 logic cells and the radio interface about another 8,500, against 17,600 on the chip. So the design was partitioned across the three tiers. The rate-critical streaming front-end (packet detection, frequency-offset correction, the transform, and a sample buffer) stays in the FPGA fabric, where wide parallel hardware runs at the sample rate. The embedded processor moves the data: the DMA engine, the radio driver, and the USB bridge to the host. The heavier, data-dependent back end (channel estimation, equalization, soft demapping, and the decoder) runs in software on the host. The FPGA hands the processor frequency-domain symbols over DMA, and the processor streams them to the host over USB; both ends run the same reference model, so the split stays bit-exact end to end.

FPGA fabric (PL) sync front-end detect · offset · timing sample ring + FFT embedded processor (PS) data movement DMA · radio driver USB bridge Host CPU (over USB) receiver back end channel estimate · equalize soft demap · decode DMA USB the whole receiver will not fit; only the sync front-end is in the FPGA, the rest runs on the host
The receiver split across three tiers: the rate-critical sync front-end in the FPGA, data movement on the embedded processor, the flexible back end on the host. The same reference model validates the cut, so it stays bit-exact end to end.

A split that fits is only useful if you can bring it up and trust it. That started one rung at a time.

The approach

Brought up one rung at a time

A higher-level failure is impossible to debug until the levels beneath it are proven, so the system was built up as a ladder, each rung a hardware gate that had to pass before the next was attempted. The lowest rung proves the entire toolchain with no design logic at all; the middle rung proves the plumbing that carries live samples; only then does the receiver itself go on. Each rung was rebuilt and re-checked on the actual radio.

1. Toolchain proof read back a known ramp 2. Plumbing live samples, clocks, buffers 3. The receiver detect, decode, deliver PASS on silicon PASS on silicon PASS on silicon
The bring-up ladder. Each rung is a separate build flashed to the radio and checked on hardware; the toolchain and the plumbing are proven before the receiver goes on.
The hard part

The bug no behavioral simulation could see

Over a cable the receiver decoded only a few percent of packets, while the same design passed every behavioral simulation with zero errors. The cause was a biased carrier-frequency-offset estimate that did not exist in the source design.

The estimator runs a sliding correlation: a running sum that each cycle adds the newest product and subtracts the oldest (from a few dozen samples back). To get the oldest, it stored the recent products in a small cache and read one out each cycle with a moving pointer, a runtime-computed address into the cache. That addressed read is where the two simulations diverge. In the source, reading the pointed-at slot is an exact, instant array lookup, so behavioral simulation is always right. On the chip, "read whichever slot the pointer names" must be built as a real selection circuit (a wide multiplexer plus address decode, with timing for when the pointer settles and the read latches); the synthesis tool builds and times that its own way, and the read it built does not return the same value as the source's lookup, so the subtracted "oldest" is wrong, the running sum drifts, and the estimate is biased. Behavioral simulation runs the ideal source array and never sees it; only a simulation of the synthesized netlist reproduces it: 218 mismatched cycles before the fix, 0 after.

Confirmed, not guessed. Reruns ruled out a lost zero-init (the netlist memory is zero-initialized), the DSP (disabling it reproduces the exact bias), and a same-slot read-while-writing clash (separating the read and write addresses still diverges). The only thing that makes the netlist match the source is removing the address entirely.
Before: addressed memory on a feedback path addressed memory read offset estimator stale value feeds back, biases the estimate After: a fixed-length shift register shift-register delay (fixed tap) offset estimator deterministic delay, synthesizes correctly Netlist simulation: 218 mismatched cycles (before) versus 0 (after). Wired link yield 2-3% rose to about 96%.
The fix: replace the addressed memory on the feedback path with a fixed-length shift register, which the synthesis tool maps deterministically. The netlist simulation that found it now passes with zero mismatches.
The transferable rule. For a fixed delay use a fixed-length shift register, not a runtime-computed address into a memory: a fixed tap has only one possible structure, identical in source and netlist, so there is nothing for the tool to build differently. And this class of defect is invisible to behavioral simulation; check the synthesized netlist.
The hard part

Separating a radio problem from a plumbing problem

The second silicon-only failure was on the host side: captures came up short for no visible reason. The instinct is to blame the radio, but the real cause was the data path between the device and the host dropping sample blocks under load, with no error raised. The way to tell the difference is to capture the live stream and classify it, rather than guess.

What the captured stream showsWhat it meansThe fix
Whole packets that fail their checksumradio or calibrationset the input level to a known-good reference
Truncated or out-of-sequence packetshost data-path overflowdrain in its own thread; size the buffer
No data at all, a refill timeouta hard stallreset the link, escalate to a replug

The overflow turned out to be dominated not by buffer size but by the consumer starving the thread that empties the link. Separating the two, and enlarging the buffer, took the loss from a measured 7.5 to 15.9 percent down to 0.2 to 1.1 percent. It never mattered to the delivered image anyway: each part is sent many times, and a checksum picks a clean copy, which is exactly why the picture comes out whole.

Launch the interactive link analyzer. It replays the real captured data with no hardware attached.

Reflection

The dead-ends, kept on purpose

The wins are more believable next to the attempts that did not work.

The memory that passed every test

The addressed buffer behind the offset bug passed behavioral simulation with zero errors for as long as it existed. Only simulating the synthesized result exposed it.

Blaming the buffer first

The host overflow looked like a too-small buffer. Enlarging it helped a little; the real lever was decoupling the drain from the heavy work, which a buffer change alone would never have revealed.

The correlator already at its floor

The preamble correlator was a tempting target for shrinking. Measured, it was already at the arithmetic floor (2,564 down to 1,583 logic cells with a fold), and further effort only traded one resource for another.

Significance

What generalizes

The receiver is one design, but the method is the point. Generating hardware from a model that is proven equal to the standard means correctness is built in, not chased afterward. Splitting a design across three tiers (the FPGA fabric, an embedded processor, and a host) at narrow, rate-matched interfaces is how a large function fits a small, cheap part. Bringing a system up rung by rung makes the inevitable silicon-only bugs findable. And the two bugs that did survive (a synthesis result that disagreed with its source, and a data path that dropped blocks silently) are not specific to radios; they recur anywhere a feedback path meets a synthesis tool, or a host captures a fast stream. Those are the parts worth carrying to the next design.