Experiment Design: CS Intervention Effectiveness (SGD-009)¶

Status: Approved Version: 1.0 Author: SaaSGuard Platform Team Date: 2026-03-14 Related tickets: SGD-009 (A/B Test Simulation), SGD-008 (Survival Analysis)

1. Hypothesis¶

Business Hypothesis¶

A structured CS outreach programme targeting at-risk customers in the 30–90 day window reduces 90-day churn rate relative to standard (reactive) CS support.

Statistical Hypotheses¶

	Statement
H₀ (null)	P(churn \| CS intervention) = P(churn \| standard support) — the intervention has no effect on 90-day churn rate
H₁ (alternative)	P(churn \| CS intervention) < P(churn \| standard support) — the intervention reduces 90-day churn rate

Direction: One-tailed (we only care if the intervention reduces churn; an increase would trigger immediate programme review regardless of statistical significance).

2. Unit of Randomisation¶

Unit: Individual customer account (customer_id)

Randomisation method: Stratified random assignment within risk tier (starter / growth / enterprise), executed at the point of risk score update. Stratification prevents accidental imbalances — e.g., assigning all starter accounts to control by chance.

Assignment ratio: 50/50 (equal treatment/control allocation maximises power for a given total sample size).

Exclusion criteria:

Customers in their first 14 days (too early for intervention signal)
Customers already past day 90 of tenure (outside the intervention window)
Customers with an open escalation ticket (CSM already engaged, can't randomise away support)

3. Intervention Description¶

Treatment arm¶

Proactive CS outreach: a structured 3-touch sequence over 14 days.

Touch	Channel	Content
Day 0	Email	Personalised health score report + 2 adoption recommendations
Day 7	In-app	Feature activation nudge (integration_connect prompt if score < 3)
Day 14	CS call	15-minute check-in with prepared risk briefing

Control arm¶

Standard (reactive) CS support: no proactive outreach. Customers receive standard in-app help, documentation access, and reactive ticket support as usual.

Ethical note: Control arm customers are not denied support — they receive the current standard of care. The intervention is additive, not substitutive.

4. Metrics¶

Primary metric¶

90-day churn rate — proportion of customers who churn within 90 days of enrolment into the experiment.

Measured at the customer level (binary outcome: churned = 1, retained = 0)
Observation period: 90 days from randomisation date

Secondary metrics¶

Metric	Purpose
Integration connect rate (30-day)	Measures whether the intervention drives activation behaviour
Feature adoption score (60-day)	Captures broader product engagement uplift
CS outreach conversion rate	Touch 1 email open + call acceptance rate — measures intervention delivery
Support ticket volume (30-day)	Negative outcome check — intervention should not increase support load

Guardrail metrics (stop if violated)¶

P(harm) > 0.10: If posterior probability of treatment increasing churn exceeds 10%, stop experiment and review intervention design.
CS capacity breach: If treatment arm CS call acceptance > 80%, throttle assignment rate.

5. Minimum Detectable Effect (MDE) and Sample Size¶

Prior belief¶

From survival analysis:

Starter tier 90-day churn rate (baseline): ~33% (KM estimate at day 90)
Growth tier 90-day churn rate (baseline): ~15%

Target MDE¶

5 percentage-point absolute reduction (e.g., 33% → 28% for starter tier).

Business context: A 5pp reduction on 500 starter accounts with $800 avg MRR = $240K ARR saved per quarter — well above the programme cost threshold.

Frequentist sample size (for reference)¶

Using scipy.stats.norm power analysis (alpha=0.05, power=0.80, one-tailed):

Baseline = 0.33, MDE = 0.05 (absolute), alpha = 0.05, power = 0.80
Required n per arm ≈ 340

At a typical B2B CS programme scale of 40–60 at-risk customers per quarter, this means a frequentist test would take 5–8 quarters to reach significance. This is why the Bayesian approach is used instead.

Bayesian sample size¶

Using a Beta-Bernoulli conjugate model with an informative prior Beta(2, 8) (encoding a prior belief that baseline churn ≈ 20%):

n per arm	P(treatment > control) for 5pp effect
20	~0.71
40	~0.82
60	~0.88
80	~0.92
100	~0.95

Recommended minimum: n = 60 per arm to achieve ≥ 88% confidence that a real 5pp effect is detected. This is achievable in 1–2 quarters for starter tier.

See notebooks/bayesian_ab_test_simulation.ipynb for the full simulation.

6. Bayesian Decision Framework¶

Prior specification¶

Beta(α=2, β=8) — informative prior encoding the belief that baseline churn ≈ 20%.

Defensibility: The prior is based on the synthetic data (enterprise + growth blended rate) and is intentionally conservative (slightly lower than the starter-specific baseline) to avoid over-claiming intervention benefit.

Posterior update¶

After observing s successes (retentions) and f failures (churns) in each arm:

Posterior = Beta(α + s, β + f)

Decision criteria¶

Outcome	Decision
P(treatment > control) ≥ 0.90	Declare intervention effective; expand to full CS team
P(treatment > control) ∈ [0.70, 0.90)	Inconclusive; extend for one more quarter
P(treatment > control) < 0.70	Intervention not effective at this scale; redesign
P(harm) > 0.10	Stop immediately; review intervention design

Credible interval requirement¶

Report 95% credible interval for the absolute churn rate difference alongside P(treatment > control). A wide CI even with high P(treatment > control) should prompt caution — it means high confidence in direction but uncertainty about magnitude.

7. Experiment Governance¶

Approval gate¶

The experiment design is reviewed and approved by: - VP of Customer Success — accountable for CS resource allocation - Head of Data — responsible for statistical methodology - Legal/Compliance — confirms control arm customers receive standard support SLA

Blinding¶

CS reps executing outreach are not blinded (they must know which customers to contact)
CS reps reviewing outcome data are blinded to treatment assignment during analysis
A dedicated analyst (not involved in CS delivery) runs the posterior update

Cadence¶

Week 2: Interim safety check — review guardrail metrics (P(harm), CS capacity)
Week 8: Mid-point posterior update — share with CS leadership for early readouts
Week 13: Final posterior report — primary decision point

Data capture¶

Required data fields to be logged to DuckDB per customer:

Field	Source
`experiment_id`	Assignment system
`customer_id`	CRM
`arm`	treatment \| control
`assignment_date`	Assignment system
`churned_90d`	Warehouse (computed at day 90)
`integration_connects_30d`	Usage events table
`feature_adoption_score_60d`	Usage events table
`cs_touches_delivered`	CS outreach log

Human-in-the-loop gate¶

Before any posterior-driven decision changes the CS SOP, results are reviewed by a human analyst and VP of CS. No automated decision rule triggers SOP changes without sign-off.

8. Limitations and Risks¶

Risk	Mitigation
SUTVA violation (control customers receive treatment info from treated peers)	Randomise at account level, not user level; monitor cross-contamination
Novelty effect (CS team more attentive during experiment)	Measure CS activity rates in control arm; flag if elevated
Selection bias in risk scoring (model assigns wrong customers to experiment)	Validate risk score distribution is balanced between arms at assignment
Regression to the mean	Ensure assignment is based on a forward-looking risk score, not a recent spike

9. Reporting Template¶

At the end of the experiment, the report must include:

Assignment summary: n per arm, balance check on key covariates (tier, MRR, tenure)
Primary outcome: Posterior distribution plot (control vs. treatment), P(treatment > control), 95% CI for absolute effect
Secondary outcomes: Integration rate, adoption score delta, ticket volume
Guardrail check: Was P(harm) ever > 0.10?
Business interpretation: ARR impact at 95% CI lower bound (conservative case)
Recommendation: Expand / extend / redesign + rationale

For the simulation validating this design, see: notebooks/bayesian_ab_test_simulation.ipynb