ADR-004: Model Drift Detection — Custom PSI + KS Test¶
Status: Accepted Date: 2026-03-16 Deciders: Joseph Wam Context: MLOps hardening — closing the "no drift detection" gap
Context¶
A model trained on historical data degrades silently when incoming customer feature distributions shift. Common causes in B2B SaaS: - A sales campaign acquires a new customer segment (MRR distribution shifts) - Product changes alter usage patterns (events_last_30d decays to near-zero) - Support workflow changes affect ticket volume and resolution time
Without drift detection, model performance degradation is only discovered after CS teams report bad recommendations — typically weeks after the issue started.
Decision¶
Custom PSI + KS test with a JSON baseline sidecar co-versioned with the model artifact.
Alternatives Considered¶
| Library | Pros | Cons | Decision |
|---|---|---|---|
| Custom PSI + KS | Zero extra dependencies, native Prometheus export, full control | More code to write | ✅ Selected |
| Evidently.ai | Full-featured, HTML reports, data drift + model drift | Adds ~200MB to image, requires a sidecar process, overkill for 12 features | ❌ Rejected |
| WhyLogs | Pandas-native, lightweight | Less interpretable for business stakeholders, no PSI threshold documentation | ❌ Rejected |
| NannyML | Strong statistical guarantees | Active development, API churn risk, not widely known | ❌ Rejected |
Why Custom PSI + KS (Dual Metric Rationale)¶
PSI — "Risk Manager" interpretability¶
Population Stability Index is the standard metric used by credit risk teams for over two decades. Its thresholds are internationally recognized:
| PSI | Interpretation | Action |
|---|---|---|
| < 0.10 | No significant drift | Monitor as usual |
| 0.10 – 0.20 | Moderate drift — investigate | Check data pipeline, segment analysis |
| > 0.20 | Significant drift — retrain candidate | Open GitHub Issue, schedule retrain |
PSI is explainable to a business audience: "The MRR distribution of incoming customers has shifted 27% from the training population." KS p-values alone are not.
KS test — Statistical rigour for model rollback decisions¶
The two-sample KS test provides a p-value that answers: "Is this distribution shift statistically significant or just sampling noise?" Combined with PSI:
- PSI > 0.20 AND KS p < 0.05 → High confidence: retrain or rollback
- PSI > 0.20 BUT KS p > 0.05 → Likely sampling artefact from small production batch
- PSI < 0.10 BUT KS p < 0.05 → Subtle shift; investigate data quality
Baseline Design¶
Storage: models/churn_training_baseline.json¶
{
"mrr": {
"min": 500.0, "max": 50000.0, "mean": 4821.3, "std": 8234.1,
"bins": [500.0, 5450.0, ...], // 11 edges → 10 buckets
"hist": [0.32, 0.18, ...], // normalized probabilities
"sample": [1200.0, 8400.0, ...] // 500 random training values for KS test
},
...
}
Versioning¶
The baseline JSON is regenerated at the end of every train_churn_model.py run (step 6
of the data-gen Docker stage and step 5 of the weekly data-pipeline.yml cron). It
is co-versioned with churn_model.pkl in DVC so rollback of model = rollback of baseline.
Prometheus Integration¶
Four module-level Gauges registered in the default prometheus_client registry,
exposed via the existing /metrics endpoint:
| Gauge | Description |
|---|---|
saasguard_drift_psi_max |
Worst-case PSI across all monitored features |
saasguard_drift_psi_by_feature{feature="..."} |
Per-feature PSI |
saasguard_drift_ks_max_statistic |
Max KS statistic |
saasguard_drift_ks_pvalue_min |
Most significant KS p-value |
No Grafana sidecar required — metrics are already on /metrics.
Actionability¶
Automated response (drift-monitor.yml, weekly Sunday 00:00 UTC)¶
- Generates a fresh data snapshot
- Runs
python -m src.infrastructure.monitoring.drift_detector --check - Writes
drift_report.jsonartifact - If exit code 1 → opens a GitHub Issue labelled
model-health, priority-high
On-call runbook¶
- Triage: Inspect
drift_report.json— which features drifted? - Data check:
dbt testto rule out upstream data quality issues - Segment check: Has a new customer segment been onboarded recently?
- Decision:
- Temporary shift (promo campaign) → document, set a retrain reminder
- Structural shift → trigger weekly data pipeline manually, rebuild image
- Escalation: If AUC drops below 0.75 on a holdout validation set, escalate to model retrain with the new data distribution included in training
Monitored Features (12 numerical)¶
mrr, tenure_days, total_events, events_last_30d, events_last_7d,
avg_adoption_score, days_since_last_event, retention_signal_count,
integration_connects_first_30d, tickets_last_30d, high_priority_tickets,
avg_resolution_hours
Binary/derived features (is_early_stage, categorical plan_tier, industry) are
excluded from drift detection — their distributions are structurally bounded and
unlikely to drift in ways that degrade model performance.
References¶
src/infrastructure/monitoring/drift_detector.py— implementation.github/workflows/drift-monitor.yml— weekly cronapp/main.py— startup initialization and/health/modelendpoint- Siddiqi, N. (2006). Credit Risk Scorecards — PSI methodology