Real-World Data Engineering Study

From Raw Data to
Real Intelligence.

85 million market events. 4 database architectures. 1 question — what does it take to turn high-volume data into actionable analytics?

End-to-end data engineering — from raw protocol parsing to production analytics. Not theory. Not slides. Working code, real data, measured results.

0
Records
0
Instruments
0
Message Types
<7ms
Query Latency
PTT +1.23% 34.75 ADVANC -0.45% 221.00 SCC +2.10% 368.00 KBANK -1.05% 132.50 SCB +0.78% 98.25 DELTA +3.42% 142.00 AOT -0.33% 62.50 CPALL +0.91% 56.75 TRUE -1.82% 8.95 GULF +1.56% 41.25 BDMS +0.42% 27.00 MINT -0.68% 31.50 PTT +1.23% 34.75 ADVANC -0.45% 221.00 SCC +2.10% 368.00 KBANK -1.05% 132.50 SCB +0.78% 98.25 DELTA +3.42% 142.00 AOT -0.33% 62.50 CPALL +0.91% 56.75 TRUE -1.82% 8.95 GULF +1.56% 41.25 BDMS +0.42% 27.00 MINT -0.68% 31.50
The Data

Real production data. Not synthetic. Not simulated.

SET ITCH tick data from the Stock Exchange of Thailand. Every order, every trade, every cancellation — captured with nanosecond precision.

85M
Total Records
22
Trading Days
16
Message Types
~4,500
Instruments
3.6M messages per trading day
55.4M order book updates
10.4M trade transactions
Nanosecond precision timestamps

"This study uses real production market data from the Stock Exchange of Thailand — not synthetic or simulated data. All benchmark results are from actual measurements on actual code."

The Architecture

End-to-end pipeline — from raw binary to production analytics.

Every component built, tested, and measured. From raw exchange feed to interactive dashboard — fully containerized, deployable with one command.

Raw Feed

Binary protocol
SET ITCH format

Protocol Parser

16 message types
Nanosecond decode

Columnar Storage

Parquet format
Compressed output

4x Database Loading

A / B / C / D
Same schema

Analytics Engine

Benchmarked queries
Surveillance scoring

Dashboard

Interactive UI
Real-time alerts

Protocol Parser Columnar Storage Docker Python Interactive Dashboard
# Pseudocode — analytics pipeline
for each database in [A, B, C, D]:
  load(trades, orders) into database
  for each query in benchmark_suite:
    result = database.execute(query)
    latency = measure(p50, runs=5)
    record(database, query, latency, result)
The Results — Trade Analytics

10.4 million trades. How fast can each architecture answer?

Six queries. Four databases. Same data, same machine, same schema. p50 latency over 5 runs.

A — Columnar In-Memory B — Columnar On-Disk C — Embedded Analytical D — Traditional RDBMS
Trade Analytics — Query Latency (ms), lower is better
Query A B C D
OHLCV 1-min7ms65ms33ms122ms
Range scan21ms104ms47ms196ms
Market-wide agg47ms70ms23ms145ms
Top-N by volume38ms69ms23ms137ms
Surveillance scoring65ms82ms34ms365ms
VWAP calculation7ms67ms22ms76ms

"A dominates single-instrument time-series queries — 7ms OHLCV means your traders see candles before competitors refresh their screen. C wins market-wide aggregations. Both under 100ms. D is 3-10x slower."

The Results — Order Book

55.4 million updates. Same question, 5x more data.

Order book data is harder — more rows, more joins, more state. Here is where architectures reveal their limits.

Order Book Analytics — Query Latency (ms), lower is better
Query A B C D
Spread analysis13ms172ms52ms106ms
Best bid/ask range30ms141ms57ms210ms
Market-wide L1149ms62ms35ms6,257ms
Top-N activity181ms80ms45ms690ms
Bid/Ask imbalance1,449ms516ms170ms11,283ms
Depth analysis333ms186ms115ms9,982ms

"Pattern holds at scale. A wins targeted queries — 13ms spread analysis means real-time order book monitoring. C wins broad scans. D collapses on 55M rows — 6 to 11 seconds per query, unusable for production surveillance."

The Hard Problem

Getting the order book right — where shortcuts fail.

Limit Order Book (LOB) reconstruction is not a query — it is a state machine. Every New, Change, and Delete message alters the book. New inserts and shifts levels down. Change updates price or quantity. Delete removes and shifts levels up. Getting it wrong means your analytics are wrong.

What LOB reconstruction requires

  • Process each message incrementally — New inserts, Change updates, Delete removes
  • Maintain price-level aggregation — quantities shift as orders arrive and leave
  • Handle level displacement — when a level empties, deeper levels move up
  • Track state across the full session — no shortcuts, no approximations

Why SQL shortcuts fail

The tempting approach — SELECT last(qty) GROUP BY level — runs fast but returns wrong results. It captures the last state of each level without processing the incremental updates that define the correct book. Fast-and-wrong is worse than slow-and-right.

Batch Reconstruction — Correctness & Speed

Method Time Correct?
C fetch + processing267msYes
B fetch + processing485msYes
A native processing845msYes
D fetch + processing881msYes
SQL shortcut (any DB)~165msWRONG

Real-Time Reconstruction

4.1 us per update
244K updates/sec
~4,000 concurrent instruments
Reconstructed Order Book — 10 Bid/Ask Levels

"The database matters less than correctness. Fast-and-wrong is worse than slow-and-right. Results validated against independent reference implementation — 100% match on all comparable timestamps."

The Infrastructure

Storage, ingestion, and what you trade for speed.

Speed is not free. Every architecture makes trade-offs between storage, ingestion rate, and query latency.

Storage Footprint

DatabaseSizeNotes
B0.17 GBBest compression (36x vs A)
C1.24 GBColumnar, embedded
D1.33 GBRow-based storage
A6.03 GBIn-memory, fastest queries

Ingestion Speed (10.4M trades)

DatabaseTimeRate
B43s243K rows/s
A145s72K rows/s
D243s43K rows/s

"B compresses 36x better than A and ingests 6x faster. A trades storage for speed — all data in RAM means sub-10ms queries."

Independent benchmark for educational purposes. 85 million records, 22 trading days, p50 latency over 5 runs, all databases containerized on the same machine. Results reflect our test environment and may vary.

From Data to Intelligence

214 alerts. 1 trading day. Real data.

This is what the pipeline produces — real-time surveillance scoring across all listed securities. One anomalous stock, six signals, one verdict.

Price & Volume — 25-Minute Window
Buy-Side Aggressor Ratio
Average Trade Size
Top alert: +28.6% return, 20.4× volume surge, 97.7% buy aggressor

This is a flagged anomaly detected by the scoring engine — not investment advice. Alert data is from real market activity.

Six Signals. One Verdict.

Abnormal Return

+28.6% in a single session — far exceeds normal volatility

Volume Surge

20.4x rolling average — concentrated buying pressure

Aggressor Dominance

97.7% buy-side aggressor — far above 50% neutral

Shrinking Trade Size

Average size drops — classic retail participation pattern

Spread Stability

Tight spread during run-up — artificial liquidity holding price

Price Reversal

Sharp reversal after peak — dump confirmed as buying collapses

Query Speed — Surveillance Scoring
Our Platform
65ms
Traditional DB
365ms
# Pseudocode — multi-signal surveillance scoring
for each instrument in universe:
  window = trades.filter(sym, time_range)
  ret = (last(price) / first(price) - 1) * 100
  vol_ratio = sum(qty) / avg(daily_vol)
  buy_pct = mean(aggressor == "buy")
  score = weighted_sum(ret, vol_ratio, buy_pct)
  if score > threshold: alert(instrument)
Applications

Any Industry. Any Event Stream.

Capital markets is just the demo. The same platform analyzes any high-frequency event data — wherever milliseconds matter.

Capital Markets

  • >Real-time market surveillance across all listed securities
  • >Manipulation detection — pump & dump, spoofing, wash trading
  • >Best execution & transaction cost analysis
  • >Regulatory compliance reporting on tick-level data

Energy & Utilities

  • >Millions of smart meter readings per hour
  • >Grid sensor streams at sub-second intervals
  • >Load anomaly detection & demand spike forecasting
  • >Outage cascade identification before propagation

Telecommunications

  • >Billions of CDRs and packet events per day
  • >Real-time DDoS pattern detection
  • >Service degradation identification & root cause
  • >Capacity allocation optimization

Manufacturing & IoT

  • >Thousands of sensors streaming at microsecond intervals
  • >Predictive maintenance — detect failure before it happens
  • >Multi-sensor signal correlation for root cause analysis
  • >Real-time quality control on production lines

Transportation & Logistics

  • >GPS ticks from thousands of vehicles, every second
  • >Real-time route optimization & ETA prediction
  • >Delivery anomaly & geofence violation detection
  • >Driver behavior pattern analysis & fleet utilization

Healthcare & Life Sciences

  • >ICU waveform data at millisecond intervals
  • >Real-time cardiac anomaly & sepsis onset detection
  • >Clinical trial event stream analytics
  • >Pharma manufacturing quality control

The pattern is always the same: massive volume, nanosecond precision, real-time decisions.

Our Approach

How We Work

Platform-agnostic means we benchmark objectively, recommend honestly, and deliver the right solution — not the most expensive one.

This study is the output of steps 1–4. The page you're reading is step 5.

1

Assess

"What data do you have?"

Audit your data landscape, volumes, latency requirements, and existing infrastructure.

2

Evaluate

"What architecture fits?"

Benchmark candidate databases on your actual data. No assumptions — measured results.

3

Build

"From pipeline to dashboard"

Build the end-to-end pipeline — ingestion, storage, analytics, visualization. Fully containerized.

4

Deploy

"One command to production"

Deploy on-premise, cloud, or hybrid. Docker-based, reproducible, and documented.

5

Optimize

"It works — now make it faster"

Monitor performance, tune queries, scale infrastructure. Continuous improvement.

This study uses licensed market data obtained through commercial agreement. Infozense is not affiliated with the Stock Exchange of Thailand. No market data is distributed through this website. This content is for educational and analytical purposes only and does not constitute investment advice.

Continue Reading

Next: The Dataset

14 message types. 85M records. What your exchange feed contains — and what you're not using.

Read Chapter 1 All Insights

contact@infozense.com  |  +66-82-242-4008  |  Bangkok, Thailand