Skip to content

ADR-003: Cloud Deployment Platform — Railway

Status: Accepted Date: 2026-03-16 Deciders: Engineering


Context

SaaSGuard requires a shared live endpoint accessible to business stakeholders (CS managers, VP Sales, Compliance reviewers) without local environment setup. The operational constraints at current traffic levels are:

  • Time-to-value: Stakeholder access must require zero local tooling — a URL, not a docker compose up instruction.
  • TCO target: Pre-PMF infrastructure cost ceiling is $0–$50/month. At this stage, operational simplicity has a higher priority than raw scalability.
  • CI/CD closure: The deployment pipeline must complete end-to-end on git push to main — lint → test → dbt build → Docker build → image push → live deploy. No manual deploy steps.
  • HTTPS without certificate management: TLS termination must be platform-provided.
  • No cold-start on the critical path: CS teams querying the prediction API during an account review cannot absorb 30-second cold-start latencies.

Decision

Railway with a Docker-based web service (Starter plan, $5/month).


Alternatives Considered

Platform TCO (pre-PMF) Time-to-Value Cold-start Decision
Railway $0 free / $5 Starter Immediate — Docker-native, GitHub auto-deploy None on Starter ✅ Selected
Render.com $0 free / $7 Starter Immediate ~30s on free tier ❌ Cold-start violates latency SLA
AWS ECS + ECR >$20/month + IAM overhead High — requires VPC, IAM, ECR, ECS task config None ❌ Operational overhead disproportionate to current traffic envelope
Fly.io $0 free tier Medium — requires flyctl CLI setup Minimal ❌ Additional toolchain dependency without meaningful benefit over Railway
Heroku $7+/month Immediate None (paid) ❌ No Docker-native web dyno on current plans

TCO rationale: AWS ECS would provide autoscaling and enterprise SLAs at a cost of $20–$80/month plus non-trivial IAM and networking configuration time. At <200 DAU and ~50 concurrent users at peak, that operational overhead is not justified. The documented upgrade path (Railway → AWS ECS) is a railway.toml → ECS task definition conversion — no application code changes required.


Consequences

Positive

  • Zero infrastructure state outside the repo: railway.toml is the complete infrastructure-as-code definition. Reprovisioning the environment requires no out-of-band configuration.
  • CI/CD closes the loop: push → lint → test → build → push image → deploy is a single pipeline. The deploy job in ci.yml fires the Railway webhook after smoke tests pass. No manual deploy steps exist.
  • Benchmarked capacity: P99 latency ~140ms at 50 concurrent users on Railway US-West. Documented in docs/benchmarks.md. Sufficient for current CS team usage.

Negative / Trade-offs

  • Resource ceiling at free tier: 512MB RAM limits concurrent users to ~50 before OOM risk. The Starter plan ($5/month) removes this constraint and increases RAM to 2GB. All published benchmarks are measured on Starter.
  • Single-region: Railway US-West. Latency from EMEA is ~180ms P99. A multi-region deployment requires moving to a managed Kubernetes platform.

Data Strategy

Demo and staging data (DuckDB + model artifacts) is baked into the Docker image at build time via the data-gen multi-stage build stage. This eliminates cold-start generation time and makes the image self-contained for the current traffic profile. On retrain, a new image push triggers a fresh deploy via the Railway webhook.

  • Image size target: ~350–450 MB (Python runtime + data + model artifacts)
  • No external storage dependency at current scale
  • Production migration path: mount DuckDB from a persistent volume (Railway Volumes or S3) and decouple data from the image lifecycle

Secret Management

GROQ_API_KEY is injected via the Railway dashboard environment variables (never committed). All other configuration is in railway.toml and safe to commit.

Upgrade Path

Traffic level Platform Monthly TCO
<200 DAU (current) Railway Starter $5
200–2,000 DAU Railway Pro or Fly.io $20–$50
>2,000 DAU AWS ECS + RDS or Snowflake $80–$200

References

  • railway.toml — service definition
  • .github/workflows/ci.ymldeploy job
  • docs/benchmarks.md — P50/P95/P99 latency table