Secure Data Sharing

The Challenge

100TB of data that can't leave the vault.

A data owner holds over 100TB of sensitive records — transactions, demographics, service utilization. Hundreds of analysts need access for research and decision-making. But the data contains personally identifiable information. It cannot be copied, downloaded, or moved.

The constraints

Data stays in the owner's cloud storage — no replication
Every query must be audited with user, timestamp, IP, and result count
Personal identifiers must never be exposed
500+ analysts need concurrent access
Response time under 2 seconds for critical queries
Multi-stage approval workflow before any individual-level access
Compliance with national data protection law

Traditional approach — and why it fails

Copy the data

100TB replication = months of transfer, double storage cost, sync nightmares, and a governance violation on day one.

Direct database access

Giving users SQL access to production systems = security risk, performance impact, no query governance.

Manual data requests

Email-based request → weeks of waiting → static CSV → outdated by arrival. Doesn't scale to 500 users.

The Insight

Send the question to the data — not the data to the question.

Instead of copying 100TB, we built a platform that reads the data in place. The analytics engine connects to the data owner's storage via an external catalog — fetching only the columns and partitions needed for each query. The data never moves.

"The users get answers. The data owner keeps control. No bytes are copied. Every access is logged."

The core idea: Researcher ─── query ───▶ Platform ─── SQL ───▶ Analytics Engine │ ┌─────────┴──────────┐ ▼ ▼ External Catalog Local Cache (schema + partitions) (pre-aggregated) │ ▼ Data Owner's Storage 100TB+ (read-only) What moves: SQL queries (bytes), aggregated results (KB) What stays: All 100TB+ of source data — never leaves the owner's account

The Architecture

Cloud or on-prem. Notebook-ready. Zero data copies.

The architecture is platform-agnostic — deployable on any major cloud or on-premise infrastructure. The analytics engine performs federated queries over 100TB+ of external data without replication. And every approved user gets a secure data science workspace with notebook access.

Cloud

or On-Premise

0 bytes

Data Replicated

~50GB

Local Cache (from 100TB+)

Notebook

Data Science Ready

End-to-end flow

Users

500+ analysts
Web portal

→

Edge Layer

CDN, WAF,
Load Balancer

→

Auth

MFA, Digital ID,
SSO Federation

→

App Layer

Container platform
Auto-scale

→

Analytics Engine

Federated query
External catalog

→

Data Source

100TB+ columnar
Owner-controlled

Built-in data science workspace

Approved users get a secure, isolated notebook environment — no setup required. They write Python, query the analytics engine directly, and export results — all within the governed platform.

Notebook Environment

Interactive Python notebook per user. Pre-installed with pandas, numpy, scipy, scikit-learn, matplotlib. Runs in an isolated container — no internet access, no data exfiltration.

Scoped Access

Each notebook connects to the analytics engine with the user's specific role. Only approved tables and columns are queryable. All queries logged to the audit trail.

Export Controls

Every export passes through k-anonymity validation. Aggregated results only — individual records cannot be extracted. Admin review for large exports.

Personal Workspace

Each user gets private storage for saved notebooks, query results, and exports. Quota-managed. Files accessible via the web portal.

Cloud or On-Prem

The entire platform — including notebooks — deploys identically on any major cloud provider or on-premise Kubernetes cluster. Fully containerized.

The Platform

6 layers — from ingestion to security.

The platform implements six architectural layers, each addressing a specific concern: how data enters, how it's cataloged, processed, stored, consumed, and protected.

Layer 1

Ingestion

On-demand via external catalog. No bulk copy.

Layer 2

Cataloging

Schema registry, partition index, data dictionary.

Layer 3

Processing

Serverless ETL, pre-aggregation, daily refresh.

Layer 4

Storage

Source stays remote. Only ~50GB local cache.

Layer 5

Consumption

Web portal, query builder, notebook, API.

Layer 6

Security

6-layer protection, RBAC, audit, encryption.

The Performance

Sub-2-second queries on 100TB+. Here's how.

95% of queries never touch the source data. A 3-layer caching strategy serves most requests from memory or pre-computed views — only novel queries scan the remote storage.

L1

Application Cache

In-memory key-value store. API responses, session data, recent results. <1ms latency. ~60% hit rate.

L2

Materialized Views

Pre-computed aggregations by region, category, segment, time period. Auto-refreshed hourly. <50ms latency. ~30% hit rate.

L3

Data Cache

Recently accessed column chunks cached on local disk. LRU eviction. <200ms latency. ~5% hit rate.

"Only ~5% of queries reach the source storage. And even those are optimized with partition pruning and predicate pushdown — scanning only the exact columns and date ranges needed."

The Security

6 layers of protection for sensitive data.

Personal identifiers are protected at every level — from encryption at rest to application-level anonymization. No single point of failure in the security model.

1

Encryption at Rest

AES-256 on all storage volumes, databases, and object stores via managed key service.

2

Encryption in Transit

TLS 1.2+ on every connection — edge to app to database to storage. No exceptions.

3

Column-Level Security

Personal ID columns excluded from all roles. Only approved users see individual-level data.

4

k-Anonymity

Results with fewer than 11 individuals are suppressed or generalized. No re-identification risk.

5

Data Generalization

Exact age → age groups. Dates → quarters. Addresses → provinces. Precision reduced by design.

6

Access Control

Role-based access. Multi-factor auth. Committee approval required for individual-level data.

The Workflow

From request to revocation — fully automated.

Users who need individual-level data go through a governed approval workflow. Access is scoped, time-limited, and automatically revoked.

Researcher ─── submit request ───▶ Approval Workflow │ ┌───────────────────────────────┤ ▼ ▼ ▼ Data Owner Ethics Board Platform Admin (parallel review) │ │ │ └───────────────┤───────────────┘ ▼ APPROVED ─── auto-provision ───▶ Scoped Access │ ┌─────────┼──────────┐ ▼ ▼ ▼ Row-level Column-level Time-limited policies grants expiry date │ ▼ Auto-revoke on expiry Archive workspace → notify user

What users get

Secure notebook environment (no internet access)
Pre-installed data science libraries
Scoped query access — only approved tables and columns
Export with k-anonymity validation
Personal workspace for saved results

What the data owner keeps

Full control — data never leaves their storage
Complete audit trail of every query
Automatic access revocation on expiry
Parallel approval with ethics board
Real-time monitoring of all user activity

Beyond This Case

The same pattern. Any sensitive data.

Any organization that holds sensitive data and needs to share access without sharing copies can use this approach.

Healthcare & Life Sciences

>Multi-institution research without data transfer
>Governed access to sensitive records
>Regulatory reporting with full audit trail

Financial Services

>Cross-institution fraud analysis without data pooling
>Regulatory reporting with audited access
>Customer analytics across subsidiaries

Government & Public Sector

>Census and population data for policy research
>Inter-agency data sharing under data protection law
>Open data platforms with tiered access levels

The question is always the same: how do you share access without sharing the data?

Why Us

We architect the access layer — not just the database.

This isn't a database project. It's a data governance platform that happens to need a fast database underneath.

⋮

Since 2016

End-to-end Data, AI, and Automation consultancy — not a cloud reseller.

⚖

Platform-Agnostic

We evaluated managed and self-managed options. Chose the one that actually works at this scale.

⚙

Full-Stack Delivery

Infrastructure, backend, frontend, security, governance, training — one team, end to end.

Sharing datawithout giving data.