We processed 85 million market events across 4 database architectures. Now we're putting the order book on an FPGA — 4,500 instruments, always live, queryable in under a microsecond.
Every time you need an order book snapshot, software must replay all updates from storage.
| Software Reconstruct | FPGA Query (target) | |
|---|---|---|
| What happens | Replay 48,860 updates from storage | Read hardware registers via bus |
| Latency | ~200ms per instrument | <1µs |
| Query 50 stocks | 50 × 200ms = 10 seconds | 50 × 1µs = 50µs |
| Book always current? | No — stale when built | Yes — updated on every tick |
Binary protocol
SET ITCH format
Custom fixed-width
protocol format
PCIe Gen3 ×16
12 GB/s bandwidth
UltraScale+
Accelerator
Binary to structured fields. 1 clock cycle per message.
4,500 books in registers. 10 levels × bid + ask. N/C/D shifting in parallel.
Host reads any book via control bus. Full 10-level snapshot in <1µs.
Streams best bid/ask/spread to host for analytics.
"The FPGA does one thing: keep all order books correct, always. Analytics, surveillance, and visualization run on the host — where software excels."
"The entire exchange order book state fits in 0.003% of available register memory."
Normal software stores price levels in an array. When a new level inserts at position 3, the CPU must shift levels 4 through 10 down — one iteration at a time. That is 10 iterations, sequential, one after another.
On an FPGA, the HLS pragma ARRAY_PARTITION complete tells the compiler to map each array element to a physical flip-flop register. The "loop" becomes parallel wiring — all 10 levels shift simultaneously in a single clock cycle.
The result: what takes software 10 iterations takes hardware 1 cycle. Not 10x faster — fundamentally different.
10 iterations, one after another. Each shift depends on the previous. Total: 10 clock cycles minimum.
All 10 levels shift simultaneously in 1 cycle. Physical wiring replaces sequential logic. Total: 1 clock cycle.
Before building hardware, we established a rigorous software baseline. These are actual measured results from our High-Frequency Data Analytics study.
| Query | A Measured | B Measured | C Measured | D Measured |
|---|---|---|---|---|
| OHLCV 1-min | 7ms | 65ms | 33ms | 122ms |
| VWAP calculation | 7ms | 67ms | 22ms | 76ms |
| Spread analysis (55M rows) | 13ms | 172ms | 52ms | 106ms |
| LOB batch reconstruction | 845ms | 485ms | 267ms | 881ms |
"Software results are real measurements from our High-Frequency Data Analytics study. FPGA targets are design specifications. Hardware results will be published upon completion."
Design specifications for the hardware implementation. These targets will be validated during hardware integration testing.
| Metric | Software Best Measured | FPGA Target Coming Soon | Status |
|---|---|---|---|
| Single LOB update | 2µs (A) | <200ns | Coming Soon |
| Book query latency | 200ms (reconstruct) | <1µs | Coming Soon |
| Simultaneous books | 1 (per query) | 4,500 (all live) | Coming Soon |
| Feed processing | 244K updates/s | ~300M updates/s | Coming Soon |
| Jitter (p99/p50) | ~10× | ~1× (deterministic) | Coming Soon |
85M records parsed, 4 databases benchmarked, LOB validated against independent reference.
Custom fixed-width protocol designed. Encoder built. Test vectors generated and validated.
Parser, order book engine, query interface — all written in C++ with HLS pragmas.
Validate FPGA logic matches software golden reference. Bit-accurate functional verification.
Resource utilization analysis, timing closure at 300MHz target frequency.
DMA transfers + block design integration for the FPGA accelerator card.
Full trading day replay. All 4,500 instruments. Latency measurement and correctness verification.