Real-World Data Engineering Study

From Raw Data to
Real Intelligence.

85 million market events. 4 database architectures. 1 question — what does it take to turn high-volume data into actionable analytics?

End-to-end data engineering — from raw protocol parsing to production analytics. Not theory. Not slides. Working code, real data, measured results.

See the Results Next Chapter →

Records

Instruments

Message Types

<7ms

Query Latency

PTT +1.23% 34.75 ADVANC -0.45% 221.00 SCC +2.10% 368.00 KBANK -1.05% 132.50 SCB +0.78% 98.25 DELTA +3.42% 142.00 AOT -0.33% 62.50 CPALL +0.91% 56.75 TRUE -1.82% 8.95 GULF +1.56% 41.25 BDMS +0.42% 27.00 MINT -0.68% 31.50 PTT +1.23% 34.75 ADVANC -0.45% 221.00 SCC +2.10% 368.00 KBANK -1.05% 132.50 SCB +0.78% 98.25 DELTA +3.42% 142.00 AOT -0.33% 62.50 CPALL +0.91% 56.75 TRUE -1.82% 8.95 GULF +1.56% 41.25 BDMS +0.42% 27.00 MINT -0.68% 31.50

The Data

Real production data. Not synthetic. Not simulated.

SET ITCH tick data from the Stock Exchange of Thailand. Every order, every trade, every cancellation — captured with nanosecond precision.

85M

Total Records

Trading Days

Message Types

~4,500

Instruments

3.6M messages per trading day

55.4M order book updates

10.4M trade transactions

Nanosecond precision timestamps

"This study uses real production market data from the Stock Exchange of Thailand — not synthetic or simulated data. All benchmark results are from actual measurements on actual code."

The Architecture

End-to-end pipeline — from raw binary to production analytics.

Every component built, tested, and measured. From raw exchange feed to interactive dashboard — fully containerized, deployable with one command.

Raw Feed

Binary protocol
SET ITCH format

→

Protocol Parser

16 message types
Nanosecond decode

→

Columnar Storage

Parquet format
Compressed output

→

4x Database Loading

A / B / C / D
Same schema

→

Analytics Engine

Benchmarked queries
Surveillance scoring

→

Dashboard

Interactive UI
Real-time alerts

Protocol Parser Columnar Storage Docker Python Interactive Dashboard

                # Pseudocode — analytics pipeline

                for each database in [A, B, C, D]:

                  load(trades, orders) into database

                  for each query in benchmark_suite:

                    result = database.execute(query)

                    latency = measure(p50, runs=5)

                    record(database, query, latency, result)

The Results — Trade Analytics

10.4 million trades. How fast can each architecture answer?

Six queries. Four databases. Same data, same machine, same schema. p50 latency over 5 runs.

A — Columnar In-Memory B — Columnar On-Disk C — Embedded Analytical D — Traditional RDBMS

Trade Analytics — Query Latency (ms), lower is better

Query	A	B	C	D
OHLCV 1-min	7ms	65ms	33ms	122ms
Range scan	21ms	104ms	47ms	196ms
Market-wide agg	47ms	70ms	23ms	145ms
Top-N by volume	38ms	69ms	23ms	137ms
Surveillance scoring	65ms	82ms	34ms	365ms
VWAP calculation	7ms	67ms	22ms	76ms

"A dominates single-instrument time-series queries — 7ms OHLCV means your traders see candles before competitors refresh their screen. C wins market-wide aggregations. Both under 100ms. D is 3-10x slower."

The Results — Order Book

55.4 million updates. Same question, 5x more data.

Order book data is harder — more rows, more joins, more state. Here is where architectures reveal their limits.

Order Book Analytics — Query Latency (ms), lower is better

Query	A	B	C	D
Spread analysis	13ms	172ms	52ms	106ms
Best bid/ask range	30ms	141ms	57ms	210ms
Market-wide L1	149ms	62ms	35ms	6,257ms
Top-N activity	181ms	80ms	45ms	690ms
Bid/Ask imbalance	1,449ms	516ms	170ms	11,283ms
Depth analysis	333ms	186ms	115ms	9,982ms

"Pattern holds at scale. A wins targeted queries — 13ms spread analysis means real-time order book monitoring. C wins broad scans. D collapses on 55M rows — 6 to 11 seconds per query, unusable for production surveillance."

The Hard Problem

Getting the order book right — where shortcuts fail.

Limit Order Book (LOB) reconstruction is not a query — it is a state machine. Every New, Change, and Delete message alters the book. New inserts and shifts levels down. Change updates price or quantity. Delete removes and shifts levels up. Getting it wrong means your analytics are wrong.

What LOB reconstruction requires

Process each message incrementally — New inserts, Change updates, Delete removes
Maintain price-level aggregation — quantities shift as orders arrive and leave
Handle level displacement — when a level empties, deeper levels move up
Track state across the full session — no shortcuts, no approximations

Why SQL shortcuts fail

The tempting approach — SELECT last(qty) GROUP BY level — runs fast but returns wrong results. It captures the last state of each level without processing the incremental updates that define the correct book. Fast-and-wrong is worse than slow-and-right.

Batch Reconstruction — Correctness & Speed

Method	Time	Correct?
C fetch + processing	267ms	Yes
B fetch + processing	485ms	Yes
A native processing	845ms	Yes
D fetch + processing	881ms	Yes
SQL shortcut (any DB)	~165ms	WRONG

Real-Time Reconstruction

4.1 us per update

244K updates/sec

~4,000 concurrent instruments

Reconstructed Order Book — 10 Bid/Ask Levels

"The database matters less than correctness. Fast-and-wrong is worse than slow-and-right. Results validated against independent reference implementation — 100% match on all comparable timestamps."

The Infrastructure

Storage, ingestion, and what you trade for speed.

Speed is not free. Every architecture makes trade-offs between storage, ingestion rate, and query latency.

Storage Footprint

Database	Size	Notes
B	0.17 GB	Best compression (36x vs A)
C	1.24 GB	Columnar, embedded
D	1.33 GB	Row-based storage
A	6.03 GB	In-memory, fastest queries

Ingestion Speed (10.4M trades)

Database	Time	Rate
B	43s	243K rows/s
A	145s	72K rows/s
D	243s	43K rows/s

"B compresses 36x better than A and ingests 6x faster. A trades storage for speed — all data in RAM means sub-10ms queries."

Independent benchmark for educational purposes. 85 million records, 22 trading days, p50 latency over 5 runs, all databases containerized on the same machine. Results reflect our test environment and may vary.

From Data to Intelligence

214 alerts. 1 trading day. Real data.

This is what the pipeline produces — real-time surveillance scoring across all listed securities. One anomalous stock, six signals, one verdict.

Price & Volume — 25-Minute Window

Buy-Side Aggressor Ratio

Average Trade Size

Top alert: +28.6% return, 20.4× volume surge, 97.7% buy aggressor

This is a flagged anomaly detected by the scoring engine — not investment advice. Alert data is from real market activity.

Six Signals. One Verdict.

↑

Abnormal Return

+28.6% in a single session — far exceeds normal volatility

▲

Volume Surge

20.4x rolling average — concentrated buying pressure

▶

Aggressor Dominance

97.7% buy-side aggressor — far above 50% neutral

▼

Shrinking Trade Size

Average size drops — classic retail participation pattern

▬

Spread Stability

Tight spread during run-up — artificial liquidity holding price

↓

Price Reversal

Sharp reversal after peak — dump confirmed as buying collapses

Query Speed — Surveillance Scoring

Our Platform

65ms

Traditional DB

365ms

                        # Pseudocode — multi-signal surveillance scoring

                        for each instrument in universe:

                          window = trades.filter(sym, time_range)

                          ret = (last(price) / first(price) - 1) * 100

                          vol_ratio = sum(qty) / avg(daily_vol)

                          buy_pct = mean(aggressor == "buy")

                          score = weighted_sum(ret, vol_ratio, buy_pct)

                          if score > threshold: alert(instrument)

Applications

Any Industry. Any Event Stream.

Capital markets is just the demo. The same platform analyzes any high-frequency event data — wherever milliseconds matter.

Capital Markets

>Real-time market surveillance across all listed securities
>Manipulation detection — pump & dump, spoofing, wash trading
>Best execution & transaction cost analysis
>Regulatory compliance reporting on tick-level data

Energy & Utilities

>Millions of smart meter readings per hour
>Grid sensor streams at sub-second intervals
>Load anomaly detection & demand spike forecasting
>Outage cascade identification before propagation

Telecommunications

>Billions of CDRs and packet events per day
>Real-time DDoS pattern detection
>Service degradation identification & root cause
>Capacity allocation optimization

Manufacturing & IoT

>Thousands of sensors streaming at microsecond intervals
>Predictive maintenance — detect failure before it happens
>Multi-sensor signal correlation for root cause analysis
>Real-time quality control on production lines

Transportation & Logistics

>GPS ticks from thousands of vehicles, every second
>Real-time route optimization & ETA prediction
>Delivery anomaly & geofence violation detection
>Driver behavior pattern analysis & fleet utilization

Healthcare & Life Sciences

>ICU waveform data at millisecond intervals
>Real-time cardiac anomaly & sepsis onset detection
>Clinical trial event stream analytics
>Pharma manufacturing quality control

The pattern is always the same: massive volume, nanosecond precision, real-time decisions.

Our Approach

How We Work

Platform-agnostic means we benchmark objectively, recommend honestly, and deliver the right solution — not the most expensive one.

This study is the output of steps 1–4. The page you're reading is step 5.

Assess

"What data do you have?"

Audit your data landscape, volumes, latency requirements, and existing infrastructure.

Evaluate

"What architecture fits?"

Benchmark candidate databases on your actual data. No assumptions — measured results.

Build

"From pipeline to dashboard"

Build the end-to-end pipeline — ingestion, storage, analytics, visualization. Fully containerized.

Deploy

"One command to production"

Deploy on-premise, cloud, or hybrid. Docker-based, reproducible, and documented.

Optimize

"It works — now make it faster"

Monitor performance, tune queries, scale infrastructure. Continuous improvement.

From Raw Data toReal Intelligence.