Production checklist

Before you turn on live traffic, run through this list. Most items apply to every product; product-specific items are tagged.

Auth & secrets

Production API keys are issued and stored in your secrets manager (not in code, not in env files committed to git).
API keys are scoped to least-privilege (e.g. payments-service has evaluate:write only, not write wildcard).
Key rotation schedule is documented (quarterly recommended).
Webhook secrets are stored separately from API keys (different blast radius).
No sandbox keys appear in production code paths.

Idempotency

Every mutating call passes externalId (or product-specific equivalent).
Retries on 5xx + network errors use exponential backoff with jitter, capped at 5 retries.
Your retry logic never retries 4xx errors (other than 429).
Webhook handlers are idempotent — same event ID processed twice is a no-op.

Webhooks

HMAC signatures are verified using crypto.timingSafeEqual (not naive ===).
Receiver responds 2xx within 10 seconds (don't do heavy work synchronously — queue and respond).
Receiver is reachable from our IP range (check firewall rules).
Subscriptions are scoped per concern (don't subscribe one URL to all events; separate by team).
You're handling the unified X-Quantum-* headers AND the legacy product-prefixed ones during the migration window.

Rate limits

You've estimated your peak RPS and confirmed it fits your tier (talk to sales if not).
Client-side rate limiting matches the server-side limit (no compounding retries).
429 responses surface a Retry-After and you honor it.

Data & PII

You're not logging API response bodies or webhook payloads (they contain PII).
Sensitive fields (NIK, full email, phone) are masked in your own logs + observability.
Data retention policy documented and aligned with UU PDP (Indonesian personal data protection).
PDP "right to erasure" flow tested for at least one customer.

Audit & compliance

You have a documented incident response runbook for "Quantum Elixir API failure during sensitive action."
You're exporting (or know how to export) audit logs per product for examiner review.
Decision-policy thresholds are documented with their rationale (AML, Anti-Fraud).
Four-eyes workflow tested for SAR submissions and rule changes.

Identity-specific

KYC tier downgrade behaviour is tested (expiring customers get notification before downgrade).
Face-match grey-zone (verdict: grey_zone) routes to manual review, not silent retry.
Dukcapil failure mode has a documented fallback (KTP-only path).
Step-up auth thresholds per riskLevel are configured and reviewed.
No fields surfaced via Identity Platform mention non-KTP doc types.

AML-specific

PPATK reporting entity ID is configured in your org settings.
Decision policy txEscalateScore and txSarScore are calibrated against historical alerts.
SAR submission workflow has tested the four-eyes split (drafter ≠ submitter).
Watchlist refresh frequency reviewed (OFAC hourly is on by default; OK?).
Adverse-media corpus enabled if your org policy requires it.

Anti-Fraud-specific

Mobile SDK integrated and tested on at least one real (non-emulator) iOS and one Android device.
AppAttest / Play Integrity verification active and working.
Hard-block rules (bypassMl: true) reviewed by your compliance officer.
ML threshold per lane configured (default 0.4 might be too aggressive for your traffic).
List sync connectors (allow/block) tested with a non-prod backup before going live.

Document Intelligence-specific

requiresReviewThreshold calibrated against your acceptable false-positive rate.
Templates for vendor-specific document variants tested with sample data.
Erase flow tested for PDP "right to erasure" requests.

Bank Statement-specific

Native parsers tested with at least 100 real (sanitized) statements per bank you'll process.
AI fallback timeout (default 30s) is acceptable for your UX.
Multi-account auto-split behavior tested for the banks you use.
Authenticity score threshold for "auto-trust" reviewed.

Orchestration-specific

All sibling service connections active and health-checked.
Human approval timeouts (timeoutHours) reviewed per workflow.
Approval onTimeout is route_to: (recoverable) not fail (terminal) for critical flows.
Workflow specs versioned with clear naming (-v1, -v2).

AI Automation-specific

LLM step outputSchema constraint tested with adversarial inputs.
Cron schedules reviewed for timezone correctness.
Inference budget alarms configured.
Webhook trigger HMAC secrets rotated from initial provisioning value.

Monitoring

You're emitting metrics on:
- API call success rate per endpoint
- p50 / p95 latency per endpoint
- 4xx vs 5xx breakdown
- Webhook delivery success rate
- Retry counts
Alerting wired for:
- 5xx rate exceeds 1%
- p95 latency 2× normal
- Webhook delivery failure rate exceeds 0.5%
- Inference budget 80% consumed (AI Automation)

Disaster recovery

You can fail over to manual processing if our API is down (degraded mode designed).
You're not relying on synchronous Quantum Elixir calls in any path that must continue working during our outage (with the explicit exception of evaluate/screening, which are the contract).
PPATK SAR filing path doesn't depend on us being up (we generate the XML; transmission is yours).

Sign-off

Compliance officer signed off on AML threshold + rule config.
Information security officer signed off on key management + storage.
Product owner signed off on customer-facing failure UX for each product.
CTO / Head of Engineering signed off on disaster-recovery posture.

We'll help you walk this list

Email integrations@quantumelixir.tech and we'll schedule a pre-production review call with our integration engineering team. Recommended ~2 weeks before your planned launch date.

Mobile device fraud detection Changelog