Skip to content

Data Dictionary

All tables are generated synthetically via Faker with realistic inter-variable correlations. Source: data/raw/*.csv → loaded into DuckDB via src/infrastructure/db/build_warehouse.py.

customers

Primary entity. One row per customer. churn_date is NULL for active customers (right-censored for survival analysis).

Column Type Description
customer_id VARCHAR UUID primary key
industry VARCHAR e.g. FinTech, HealthTech, LegalTech, HR Tech
plan_tier VARCHAR starter | growth | enterprise
signup_date DATE Date of first contract
mrr DECIMAL Monthly Recurring Revenue (USD)
churn_date DATE / NULL NULL = still active (censored)
upgrade_date DATE / NULL Date customer upgraded to next plan tier. NULL = not yet upgraded. Added v0.9.0 for expansion propensity model.

Correlations baked in: enterprise tier churns less; starter tier with low feature_adoption_score churns within 90 days at elevated rate. Customers with upgrade_date IS NOT NULL have statistically higher premium_feature_trials_30d (Mann-Whitney U p < 0.05).


usage_events

One row per product interaction. High cardinality (~10M rows for 5k customers over 2 years).

Column Type Description
event_id VARCHAR UUID
customer_id VARCHAR FK → customers
timestamp TIMESTAMP UTC
event_type VARCHAR evidence_upload | monitoring_run | report_view | user_invite | integration_connect | api_call | premium_feature_trial
feature_adoption_score FLOAT 0–1 composite score at time of event

premium_feature_trial (added v0.9.0): customer accessed a feature available only in their next plan tier. This is the single strongest expansion signal (mean_abs_shap=3.94). Destiny-weighted — customers with upgrade_date IS NOT NULL are 4–6× more likely to generate this event type.

Correlations baked in: Declining event frequency in the 30 days before churn; integration_connect events are strong retention signals; premium_feature_trial frequency positively correlates with upgrade propensity.


gtm_opportunities

CRM-style table. One row per sales/expansion opportunity.

Column Type Description
opp_id VARCHAR UUID
customer_id VARCHAR FK → customers
stage VARCHAR prospecting | qualification | proposal | closed_won | closed_lost
close_date DATE Actual or expected close date
amount DECIMAL USD opportunity value
sales_owner VARCHAR Anonymised rep name
opportunity_type VARCHAR expansion | new_business. Added v0.9.0. expansion = existing customer upgrade; new_business = net-new logo.

support_tickets

Customer support interactions.

Column Type Description
ticket_id VARCHAR UUID
customer_id VARCHAR FK → customers
created_date DATE
priority VARCHAR low | medium | high | critical
resolution_time INTEGER Hours to resolution
topic VARCHAR compliance | integration | billing | onboarding | feature_request

Correlations baked in: Spike in high/critical tickets in 60 days before churn; integration and onboarding topics are leading churn indicators.


Derived Features — EDA Analysis

Computed in EDA notebooks and pre-modelling signal tests. These features are computed for all customers (churned + active) using direct queries against the raw schema. They extend mart_customer_churn_features (which covers active customers only) to support survival analysis and EDA.

Column Type Description
duration_days INTEGER Survival time: DATEDIFF('day', signup_date, churn_date) for churned; DATEDIFF('day', signup_date, DATE '2026-03-14') for active (right-censored)
event INTEGER Survival event indicator: 1 = customer churned, 0 = right-censored (still active at reference date)
integration_connects_first_30d INTEGER Count of integration_connect usage events in the first 30 days of the customer's tenure (timestamp <= signup_date + 30 days). Activation gate feature.
retention_signal_count INTEGER Count of evidence_upload, monitoring_run, and report_view events across the customer's full tenure. Proxy for deep product adoption.
events_last_30d INTEGER Usage events in the 30-day window before churn_date (churned) or reference date (active). Captures the decay signal.
integration_bucket VARCHAR Derived category: "0" / "1–2" / "3–5" / "6+" based on integration_connects_first_30d. Used in integration gate visualisations.

Correlation with churn label (point-biserial r, all p < 0.001):

Feature r Direction
events_last_30d −0.38 Negative (usage decay → churn)
avg_adoption_score −0.34 Negative (low adoption → churn)
retention_signal_count −0.32 Negative (deep adoption → retention)
high_priority_tickets +0.27 Positive (ticket spike → churn risk)
integration_connects_first_30d −0.24 Negative (integration → activation)

mart_customer_churn_features (dbt mart)

Pre-aggregated feature mart built by dbt. One row per active customer (WHERE is_active = TRUE). Queried at inference time by ChurnFeatureExtractor (~1ms per request). All 15 model features sourced here.

Column Type Description
customer_id VARCHAR FK → customers
mrr DECIMAL Monthly Recurring Revenue (USD)
tenure_days INTEGER Days from signup_date to reference date
plan_tier VARCHAR starter | growth | enterprise
industry VARCHAR fintech | healthtech | legaltech | proptech | saas
total_events INTEGER Lifetime usage event count
events_last_30d INTEGER Events in 30-day window before reference date
events_last_7d INTEGER Events in 7-day window before reference date
avg_adoption_score FLOAT Mean feature_adoption_score across all events
days_since_last_event INTEGER Days since most recent usage event
retention_signal_count INTEGER Count of high-value events: evidence_upload, monitoring_run, report_view
integration_connects_first_30d INTEGER integration_connect events in first 30 days (activation gate)
tickets_last_30d INTEGER Support tickets created in 30-day window
high_priority_tickets INTEGER Count of high or critical priority tickets
avg_resolution_hours FLOAT Mean hours to ticket resolution
is_early_stage BOOLEAN TRUE if tenure_days ≤ 90

Derived at runtime (not in mart):

Computed value Where Description
usage_decay_score DuckDBRiskSignalsRepository max(0, 1 - events_last_30d / events_prev_30d) — 0 = no decay, 1 = complete drop-off

risk_signals

Computed/external risk flags per customer. Updated periodically.

Column Type Description
customer_id VARCHAR FK → customers
compliance_gap_score FLOAT 0–1; higher = more compliance gaps detected
vendor_risk_flags INTEGER Count of third-party vendor risk alerts

mart_customer_expansion_features (dbt mart — v0.9.0)

Pre-aggregated expansion feature mart. One row per active non-upgraded customer (WHERE is_active = TRUE AND is_upgraded = FALSE). This is the inference population for the expansion propensity model. Queried by ExpansionFeatureExtractor (~1ms).

Scope note: Upgraded customers are excluded from this mart because they are no longer expansion candidates. Training uses a wider query that includes upgraded customers as positive labels.

Column Type Description
customer_id VARCHAR FK → customers
15 churn features (see above) Reused from mart_customer_churn_features via JOIN — no duplication
premium_feature_trials_30d INTEGER Count of premium_feature_trial events in last 30 days. Strongest expansion signal (mean
feature_request_tickets_90d INTEGER Count of support tickets with topic = 'feature_request' in last 90 days. Customer explicitly asking for above-tier capabilities.
has_open_expansion_opp BOOLEAN TRUE if an open GTM opportunity with opportunity_type = 'expansion' exists for this customer. Sales intent signal — use with caution (leakage guard applied).
expansion_opp_amount DECIMAL USD value of the open expansion opportunity (0 if none).
mrr_tier_ceiling_pct FLOAT (mrr − tier_floor) / (tier_ceiling − tier_floor), clamped [0, 1]. Measures how close the customer's MRR is to the top of their tier. High value (~1.0) = outgrown tier → ripe for upgrade conversation.