Experiment Design: CS Intervention Effectiveness (SGD-009)¶
Status: Approved Version: 1.0 Author: SaaSGuard Platform Team Date: 2026-03-14 Related tickets: SGD-009 (A/B Test Simulation), SGD-008 (Survival Analysis)
1. Hypothesis¶
Business Hypothesis¶
A structured CS outreach programme targeting at-risk customers in the 30–90 day window reduces 90-day churn rate relative to standard (reactive) CS support.
Statistical Hypotheses¶
| Statement | |
|---|---|
| H₀ (null) | P(churn | CS intervention) = P(churn | standard support) — the intervention has no effect on 90-day churn rate |
| H₁ (alternative) | P(churn | CS intervention) < P(churn | standard support) — the intervention reduces 90-day churn rate |
Direction: One-tailed (we only care if the intervention reduces churn; an increase would trigger immediate programme review regardless of statistical significance).
2. Unit of Randomisation¶
Unit: Individual customer account (customer_id)
Randomisation method: Stratified random assignment within risk tier (starter / growth / enterprise), executed at the point of risk score update. Stratification prevents accidental imbalances — e.g., assigning all starter accounts to control by chance.
Assignment ratio: 50/50 (equal treatment/control allocation maximises power for a given total sample size).
Exclusion criteria:
- Customers in their first 14 days (too early for intervention signal)
- Customers already past day 90 of tenure (outside the intervention window)
- Customers with an open escalation ticket (CSM already engaged, can't randomise away support)
3. Intervention Description¶
Treatment arm¶
Proactive CS outreach: a structured 3-touch sequence over 14 days.
| Touch | Channel | Content |
|---|---|---|
| Day 0 | Personalised health score report + 2 adoption recommendations | |
| Day 7 | In-app | Feature activation nudge (integration_connect prompt if score < 3) |
| Day 14 | CS call | 15-minute check-in with prepared risk briefing |
Control arm¶
Standard (reactive) CS support: no proactive outreach. Customers receive standard in-app help, documentation access, and reactive ticket support as usual.
Ethical note: Control arm customers are not denied support — they receive the current standard of care. The intervention is additive, not substitutive.
4. Metrics¶
Primary metric¶
90-day churn rate — proportion of customers who churn within 90 days of enrolment into the experiment.
- Measured at the customer level (binary outcome: churned = 1, retained = 0)
- Observation period: 90 days from randomisation date
Secondary metrics¶
| Metric | Purpose |
|---|---|
| Integration connect rate (30-day) | Measures whether the intervention drives activation behaviour |
| Feature adoption score (60-day) | Captures broader product engagement uplift |
| CS outreach conversion rate | Touch 1 email open + call acceptance rate — measures intervention delivery |
| Support ticket volume (30-day) | Negative outcome check — intervention should not increase support load |
Guardrail metrics (stop if violated)¶
- P(harm) > 0.10: If posterior probability of treatment increasing churn exceeds 10%, stop experiment and review intervention design.
- CS capacity breach: If treatment arm CS call acceptance > 80%, throttle assignment rate.
5. Minimum Detectable Effect (MDE) and Sample Size¶
Prior belief¶
From survival analysis:
- Starter tier 90-day churn rate (baseline): ~33% (KM estimate at day 90)
- Growth tier 90-day churn rate (baseline): ~15%
Target MDE¶
5 percentage-point absolute reduction (e.g., 33% → 28% for starter tier).
Business context: A 5pp reduction on 500 starter accounts with $800 avg MRR = $240K ARR saved per quarter — well above the programme cost threshold.
Frequentist sample size (for reference)¶
Using scipy.stats.norm power analysis (alpha=0.05, power=0.80, one-tailed):
At a typical B2B CS programme scale of 40–60 at-risk customers per quarter, this means a frequentist test would take 5–8 quarters to reach significance. This is why the Bayesian approach is used instead.
Bayesian sample size¶
Using a Beta-Bernoulli conjugate model with an informative prior Beta(2, 8) (encoding
a prior belief that baseline churn ≈ 20%):
| n per arm | P(treatment > control) for 5pp effect |
|---|---|
| 20 | ~0.71 |
| 40 | ~0.82 |
| 60 | ~0.88 |
| 80 | ~0.92 |
| 100 | ~0.95 |
Recommended minimum: n = 60 per arm to achieve ≥ 88% confidence that a real 5pp effect is detected. This is achievable in 1–2 quarters for starter tier.
See notebooks/bayesian_ab_test_simulation.ipynb for the full simulation.
6. Bayesian Decision Framework¶
Prior specification¶
Beta(α=2, β=8) — informative prior encoding the belief that baseline churn ≈ 20%.
Defensibility: The prior is based on the synthetic data (enterprise + growth blended rate) and is intentionally conservative (slightly lower than the starter-specific baseline) to avoid over-claiming intervention benefit.
Posterior update¶
After observing s successes (retentions) and f failures (churns) in each arm:
Decision criteria¶
| Outcome | Decision |
|---|---|
| P(treatment > control) ≥ 0.90 | Declare intervention effective; expand to full CS team |
| P(treatment > control) ∈ [0.70, 0.90) | Inconclusive; extend for one more quarter |
| P(treatment > control) < 0.70 | Intervention not effective at this scale; redesign |
| P(harm) > 0.10 | Stop immediately; review intervention design |
Credible interval requirement¶
Report 95% credible interval for the absolute churn rate difference alongside P(treatment > control). A wide CI even with high P(treatment > control) should prompt caution — it means high confidence in direction but uncertainty about magnitude.
7. Experiment Governance¶
Approval gate¶
The experiment design is reviewed and approved by: - VP of Customer Success — accountable for CS resource allocation - Head of Data — responsible for statistical methodology - Legal/Compliance — confirms control arm customers receive standard support SLA
Blinding¶
- CS reps executing outreach are not blinded (they must know which customers to contact)
- CS reps reviewing outcome data are blinded to treatment assignment during analysis
- A dedicated analyst (not involved in CS delivery) runs the posterior update
Cadence¶
- Week 2: Interim safety check — review guardrail metrics (P(harm), CS capacity)
- Week 8: Mid-point posterior update — share with CS leadership for early readouts
- Week 13: Final posterior report — primary decision point
Data capture¶
Required data fields to be logged to DuckDB per customer:
| Field | Source |
|---|---|
experiment_id |
Assignment system |
customer_id |
CRM |
arm |
treatment | control |
assignment_date |
Assignment system |
churned_90d |
Warehouse (computed at day 90) |
integration_connects_30d |
Usage events table |
feature_adoption_score_60d |
Usage events table |
cs_touches_delivered |
CS outreach log |
Human-in-the-loop gate¶
Before any posterior-driven decision changes the CS SOP, results are reviewed by a human analyst and VP of CS. No automated decision rule triggers SOP changes without sign-off.
8. Limitations and Risks¶
| Risk | Mitigation |
|---|---|
| SUTVA violation (control customers receive treatment info from treated peers) | Randomise at account level, not user level; monitor cross-contamination |
| Novelty effect (CS team more attentive during experiment) | Measure CS activity rates in control arm; flag if elevated |
| Selection bias in risk scoring (model assigns wrong customers to experiment) | Validate risk score distribution is balanced between arms at assignment |
| Regression to the mean | Ensure assignment is based on a forward-looking risk score, not a recent spike |
9. Reporting Template¶
At the end of the experiment, the report must include:
- Assignment summary: n per arm, balance check on key covariates (tier, MRR, tenure)
- Primary outcome: Posterior distribution plot (control vs. treatment), P(treatment > control), 95% CI for absolute effect
- Secondary outcomes: Integration rate, adoption score delta, ticket volume
- Guardrail check: Was P(harm) ever > 0.10?
- Business interpretation: ARR impact at 95% CI lower bound (conservative case)
- Recommendation: Expand / extend / redesign + rationale
For the simulation validating this design, see:
notebooks/bayesian_ab_test_simulation.ipynb