The principles that made the result reproducible, for the reader who wants the engineering underneath.
Aligned-by-construction, not patched after the fact
The hardest bugs in a streaming pipeline come from data and its control travelling on separate delay chains: a fault in one chain shows up cycles later at a block that is actually innocent. The discipline here is to carry data and its control together as one bundle down a single path, so a wrong value can only have come from that path. The overlap delay line is the only place signals are deliberately delayed, and it delays everything together.
Debug by comparing to the model, stage by stage
Because a cycle-accurate model exists with the same stage boundaries, any hardware bug is found by running both on the same input and comparing at every boundary: the first mismatch is the bug, and its size hints at the cause - about half the lanes wrong means a broken datapath, a few percent means a misalignment, a single lane means an index. This turns a multi-hour hunt into a few minutes.
Measure pipeline depth, do not compute it
Register placement at module boundaries makes hand-computed latencies wrong by a cycle or two - enough to silently corrupt an iterative decoder. The delay-line depth and every alignment constant are taken from a simulation that prints the actual arrival cycle, never from arithmetic on a block diagram.
Test with the hardest realistic input
All-zero and uniform inputs hide three whole classes of bug: a wrong rotation direction, an arithmetic overflow that never triggers, and a sign error that uniform parity masks. The test set deliberately includes high-dynamic-range channels and known slow-to-converge patterns so those classes cannot hide.
Saturate, never truncate
Narrowing a wide signed value by dropping high bits flips its sign on overflow - a strong negative belief becomes a strong positive one, and the decoder diverges. Every narrowing in the design clamps to the largest representable magnitude instead, in the model and the hardware identically. This is not a detail; it is the difference between a decoder that works and one that occasionally, unreproducibly, does not.
Correctness first (match the model exactly), then timing (meet the clock after place-and-route), then throughput (the real figure of merit), then resources. Every candidate change reported all four together, and a change was only kept if measurement - never prediction - confirmed it helped.