An 802.11a receiver was generated as bit-exact hardware from an executable model, then deployed on a low-cost software-defined radio. Because the radio's chip is small, only the rate-critical synchronization front-end (packet detection, frequency-offset correction, the transform, and a sample buffer) runs on the FPGA; channel estimation, equalization, soft demapping, and convolutional decoding run in software on the host. The work is split across three tiers: the FPGA fabric, the radio's embedded processor, and the host. End to end it recovers a standard waveform with zero bit errors, runs every data rate from BPSK to 64-QAM, and decodes an unbroken stream of packets over the air. This report is the honest engineering story: how the work was split across three processors, how it was brought up one rung at a time, and the two bugs that only showed up on real silicon, one of which no behavioral simulation could see.
| Result | Value | Where it was measured |
|---|---|---|
| Standard waveform recovered bit-for-bit | 0 errors / 38,400 bits | bit-level cross-validation MEASURED |
| Receive clock, Zynq-7010 (the radio's chip) | 162.7 MHz | place-and-route closure; 100 MHz on the board MEASURED |
| Receiver PHY, synthesized (would not fit the fabric with the radio IP) | 8,512 LUT / 8,233 FF | utilization report MEASURED |
| Radio interface IP vs device budget | 8,459 of 17,600 LUT | forces the sync-only split MEASURED |
| Delivered image over the air, every data rate | 100% bit-exact | retransmit + checksum, all 8 modes MEASURED |
| Raw per-packet over the air (a channel hit, not a host drop) | 99.88% | 831 / 832, 0 host stalls MEASURED |
Those are the numbers. Landing them meant fitting the receiver onto a chip that, on its own, the full design does not fit on.
The radio's chip pairs a small FPGA (the programmable logic, PL) with an embedded ARM processor running Linux (the processing system, PS), and the host computer sits on the other end of a USB link. The full receiver plus the radio's own interface logic does not fit the FPGA: the receiver PHY alone is about 8,500 logic cells and the radio interface about another 8,500, against 17,600 on the chip. So the design was partitioned across the three tiers. The rate-critical streaming front-end (packet detection, frequency-offset correction, the transform, and a sample buffer) stays in the FPGA fabric, where wide parallel hardware runs at the sample rate. The embedded processor moves the data: the DMA engine, the radio driver, and the USB bridge to the host. The heavier, data-dependent back end (channel estimation, equalization, soft demapping, and the decoder) runs in software on the host. The FPGA hands the processor frequency-domain symbols over DMA, and the processor streams them to the host over USB; both ends run the same reference model, so the split stays bit-exact end to end.
A split that fits is only useful if you can bring it up and trust it. That started one rung at a time.
A higher-level failure is impossible to debug until the levels beneath it are proven, so the system was built up as a ladder, each rung a hardware gate that had to pass before the next was attempted. The lowest rung proves the entire toolchain with no design logic at all; the middle rung proves the plumbing that carries live samples; only then does the receiver itself go on. Each rung was rebuilt and re-checked on the actual radio.
Over a cable the receiver decoded only a few percent of packets, while the same design passed every behavioral simulation with zero errors. The cause was a biased carrier-frequency-offset estimate that did not exist in the source design.
The estimator runs a sliding correlation: a running sum that each cycle adds the newest product and subtracts the oldest (from a few dozen samples back). To get the oldest, it stored the recent products in a small cache and read one out each cycle with a moving pointer, a runtime-computed address into the cache. That addressed read is where the two simulations diverge. In the source, reading the pointed-at slot is an exact, instant array lookup, so behavioral simulation is always right. On the chip, "read whichever slot the pointer names" must be built as a real selection circuit (a wide multiplexer plus address decode, with timing for when the pointer settles and the read latches); the synthesis tool builds and times that its own way, and the read it built does not return the same value as the source's lookup, so the subtracted "oldest" is wrong, the running sum drifts, and the estimate is biased. Behavioral simulation runs the ideal source array and never sees it; only a simulation of the synthesized netlist reproduces it: 218 mismatched cycles before the fix, 0 after.
The second silicon-only failure was on the host side: captures came up short for no visible reason. The instinct is to blame the radio, but the real cause was the data path between the device and the host dropping sample blocks under load, with no error raised. The way to tell the difference is to capture the live stream and classify it, rather than guess.
| What the captured stream shows | What it means | The fix |
|---|---|---|
| Whole packets that fail their checksum | radio or calibration | set the input level to a known-good reference |
| Truncated or out-of-sequence packets | host data-path overflow | drain in its own thread; size the buffer |
| No data at all, a refill timeout | a hard stall | reset the link, escalate to a replug |
The overflow turned out to be dominated not by buffer size but by the consumer starving the thread that empties the link. Separating the two, and enlarging the buffer, took the loss from a measured 7.5 to 15.9 percent down to 0.2 to 1.1 percent. It never mattered to the delivered image anyway: each part is sent many times, and a checksum picks a clean copy, which is exactly why the picture comes out whole.
Launch the interactive link analyzer. It replays the real captured data with no hardware attached.
The wins are more believable next to the attempts that did not work.
The addressed buffer behind the offset bug passed behavioral simulation with zero errors for as long as it existed. Only simulating the synthesized result exposed it.
The host overflow looked like a too-small buffer. Enlarging it helped a little; the real lever was decoupling the drain from the heavy work, which a buffer change alone would never have revealed.
The preamble correlator was a tempting target for shrinking. Measured, it was already at the arithmetic floor (2,564 down to 1,583 logic cells with a fold), and further effort only traded one resource for another.
The receiver is one design, but the method is the point. Generating hardware from a model that is proven equal to the standard means correctness is built in, not chased afterward. Splitting a design across three tiers (the FPGA fabric, an embedded processor, and a host) at narrow, rate-matched interfaces is how a large function fits a small, cheap part. Bringing a system up rung by rung makes the inevitable silicon-only bugs findable. And the two bugs that did survive (a synthesis result that disagreed with its source, and a data path that dropped blocks silently) are not specific to radios; they recur anywhere a feedback path meets a synthesis tool, or a host captures a fast stream. Those are the parts worth carrying to the next design.