An organization needed 500+ analysts to query 100TB+ of sensitive records. The rule: no one gets a copy. We built a platform where they can ask any question — and the data never moves.
A data owner holds over 100TB of sensitive records — transactions, demographics, service utilization. Hundreds of analysts need access for research and decision-making. But the data contains personally identifiable information. It cannot be copied, downloaded, or moved.
100TB replication = months of transfer, double storage cost, sync nightmares, and a governance violation on day one.
Giving users SQL access to production systems = security risk, performance impact, no query governance.
Email-based request → weeks of waiting → static CSV → outdated by arrival. Doesn't scale to 500 users.
Instead of copying 100TB, we built a platform that reads the data in place. The analytics engine connects to the data owner's storage via an external catalog — fetching only the columns and partitions needed for each query. The data never moves.
"The users get answers. The data owner keeps control. No bytes are copied. Every access is logged."
The architecture is platform-agnostic — deployable on any major cloud or on-premise infrastructure. The analytics engine performs federated queries over 100TB+ of external data without replication. And every approved user gets a secure data science workspace with notebook access.
500+ analysts
Web portal
CDN, WAF,
Load Balancer
MFA, Digital ID,
SSO Federation
Container platform
Auto-scale
Federated query
External catalog
100TB+ columnar
Owner-controlled
Approved users get a secure, isolated notebook environment — no setup required. They write Python, query the analytics engine directly, and export results — all within the governed platform.
Interactive Python notebook per user. Pre-installed with pandas, numpy, scipy, scikit-learn, matplotlib. Runs in an isolated container — no internet access, no data exfiltration.
Each notebook connects to the analytics engine with the user's specific role. Only approved tables and columns are queryable. All queries logged to the audit trail.
Every export passes through k-anonymity validation. Aggregated results only — individual records cannot be extracted. Admin review for large exports.
Each user gets private storage for saved notebooks, query results, and exports. Quota-managed. Files accessible via the web portal.
The entire platform — including notebooks — deploys identically on any major cloud provider or on-premise Kubernetes cluster. Fully containerized.
The platform implements six architectural layers, each addressing a specific concern: how data enters, how it's cataloged, processed, stored, consumed, and protected.
On-demand via external catalog. No bulk copy.
Schema registry, partition index, data dictionary.
Serverless ETL, pre-aggregation, daily refresh.
Source stays remote. Only ~50GB local cache.
Web portal, query builder, notebook, API.
6-layer protection, RBAC, audit, encryption.
95% of queries never touch the source data. A 3-layer caching strategy serves most requests from memory or pre-computed views — only novel queries scan the remote storage.
In-memory key-value store. API responses, session data, recent results. <1ms latency. ~60% hit rate.
Pre-computed aggregations by region, category, segment, time period. Auto-refreshed hourly. <50ms latency. ~30% hit rate.
Recently accessed column chunks cached on local disk. LRU eviction. <200ms latency. ~5% hit rate.
"Only ~5% of queries reach the source storage. And even those are optimized with partition pruning and predicate pushdown — scanning only the exact columns and date ranges needed."
Personal identifiers are protected at every level — from encryption at rest to application-level anonymization. No single point of failure in the security model.
AES-256 on all storage volumes, databases, and object stores via managed key service.
TLS 1.2+ on every connection — edge to app to database to storage. No exceptions.
Personal ID columns excluded from all roles. Only approved users see individual-level data.
Results with fewer than 11 individuals are suppressed or generalized. No re-identification risk.
Exact age → age groups. Dates → quarters. Addresses → provinces. Precision reduced by design.
Role-based access. Multi-factor auth. Committee approval required for individual-level data.
Users who need individual-level data go through a governed approval workflow. Access is scoped, time-limited, and automatically revoked.
Any organization that holds sensitive data and needs to share access without sharing copies can use this approach.
The question is always the same: how do you share access without sharing the data?
This isn't a database project. It's a data governance platform that happens to need a fast database underneath.
End-to-end Data, AI, and Automation consultancy — not a cloud reseller.
We evaluated managed and self-managed options. Chose the one that actually works at this scale.
Infrastructure, backend, frontend, security, governance, training — one team, end to end.