ADR-001: Use DuckDB as the Analytical Warehouse¶
Date: 2026-03-14 Status: Accepted Deciders: Engineering
Context¶
SaaSGuard requires an analytical warehouse to persist ~5,500 customers, millions of usage events, and the dbt-built feature marts that feed ML training and inference. The warehouse must satisfy three constraints:
- Auditability: Feature engineering lives in dbt SQL. The warehouse must support dbt-core natively so that any risk tier surfaced in a dashboard is reproducible from SQL alone, without a Python runtime.
- Versionability: Model artifacts and training data must be co-versioned. Reproducing a historical model run requires restoring both the artifact and the data state.
- Operational simplicity: The system must run in a single-command Docker environment with no external services, cloud credentials, or connection management overhead.
Options evaluated: PostgreSQL, SQLite, Snowflake, DuckDB.
Decision¶
DuckDB — file-based columnar warehouse, .duckdb file mounted as a Docker volume,
tracked by DVC alongside model artifacts.
Rationale¶
| Criterion | DuckDB | PostgreSQL | Snowflake |
|---|---|---|---|
| Zero-ops setup | ✅ file-based | ❌ server process | ❌ cloud account required |
| OLAP performance | ✅ columnar, vectorised | ⚠️ row-store | ✅ columnar |
| dbt-core adapter | ✅ dbt-duckdb | ✅ dbt-postgres | ✅ dbt-snowflake |
| Superset support | ✅ duckdb-engine | ✅ | ✅ |
| Cost | ✅ no licensing | ✅ no licensing | ⚠️ consumption-based |
| DVC versionability | ✅ single file — dvc add |
❌ requires dump + restore | ❌ external state |
DuckDB's columnar execution engine handles the analytical query patterns (feature aggregation
CTEs, window functions, GROUP BY across millions of events) with sub-second latency on the
5,500-customer dataset. The single-file model means dvc repro retrains deterministically
from the exact data state that produced any historical model version.
Migration path: The dbt-duckdb → dbt-snowflake swap is a profiles.yml change.
No dbt model SQL requires modification. This is the documented upgrade path if write
concurrency or dataset scale requirements change.
Consequences¶
Positive:
- Entire warehouse is a single versioned file. dvc pull restores both data and model
artifacts to a known state without cloud credentials or network access.
- dbt run and feature extraction query the same physical file — no sync lag between
the feature mart and the inference path.
- dbt data contracts (schema.yml tests) run against the live warehouse on every CI run.
Negative:
- DuckDB does not support high-concurrency writes. This is acceptable because SaaSGuard
is an analytical system — reads dominate, and the one write path (dbt build) is a
scheduled batch process, not a transactional workload.
- File-based storage is not appropriate for multi-node deployments. At the point where
concurrent dbt + API read load requires horizontal scaling, the profiles.yml migration
to Snowflake or Redshift is the planned path.