Production checklist
Before you turn on live traffic, run through this list. Most items apply to every product; product-specific items are tagged.
Auth & secrets
- Production API keys are issued and stored in your secrets manager (not in code, not in env files committed to git).
- API keys are scoped to least-privilege (e.g. payments-service has
evaluate:writeonly, notwritewildcard). - Key rotation schedule is documented (quarterly recommended).
- Webhook secrets are stored separately from API keys (different blast radius).
- No sandbox keys appear in production code paths.
Idempotency
- Every mutating call passes
externalId(or product-specific equivalent). - Retries on 5xx + network errors use exponential backoff with jitter, capped at 5 retries.
- Your retry logic never retries 4xx errors (other than 429).
- Webhook handlers are idempotent — same event ID processed twice is a no-op.
Webhooks
- HMAC signatures are verified using
crypto.timingSafeEqual(not naive===). - Receiver responds 2xx within 10 seconds (don't do heavy work synchronously — queue and respond).
- Receiver is reachable from our IP range (check firewall rules).
- Subscriptions are scoped per concern (don't subscribe one URL to all events; separate by team).
- You're handling the unified
X-Quantum-*headers AND the legacy product-prefixed ones during the migration window.
Rate limits
- You've estimated your peak RPS and confirmed it fits your tier (talk to sales if not).
- Client-side rate limiting matches the server-side limit (no compounding retries).
- 429 responses surface a
Retry-Afterand you honor it.
Data & PII
- You're not logging API response bodies or webhook payloads (they contain PII).
- Sensitive fields (NIK, full email, phone) are masked in your own logs + observability.
- Data retention policy documented and aligned with UU PDP (Indonesian personal data protection).
- PDP "right to erasure" flow tested for at least one customer.
Audit & compliance
- You have a documented incident response runbook for "Quantum Elixir API failure during sensitive action."
- You're exporting (or know how to export) audit logs per product for examiner review.
- Decision-policy thresholds are documented with their rationale (AML, Anti-Fraud).
- Four-eyes workflow tested for SAR submissions and rule changes.
Identity-specific
- KYC tier downgrade behaviour is tested (expiring customers get notification before downgrade).
- Face-match grey-zone (
verdict: grey_zone) routes to manual review, not silent retry. - Dukcapil failure mode has a documented fallback (KTP-only path).
- Step-up auth thresholds per
riskLevelare configured and reviewed. - No fields surfaced via Identity Platform mention non-KTP doc types.
AML-specific
- PPATK reporting entity ID is configured in your org settings.
- Decision policy
txEscalateScoreandtxSarScoreare calibrated against historical alerts. - SAR submission workflow has tested the four-eyes split (drafter ≠ submitter).
- Watchlist refresh frequency reviewed (OFAC hourly is on by default; OK?).
- Adverse-media corpus enabled if your org policy requires it.
Anti-Fraud-specific
- Mobile SDK integrated and tested on at least one real (non-emulator) iOS and one Android device.
- AppAttest / Play Integrity verification active and working.
- Hard-block rules (
bypassMl: true) reviewed by your compliance officer. - ML threshold per lane configured (default 0.4 might be too aggressive for your traffic).
- List sync connectors (allow/block) tested with a non-prod backup before going live.
Document Intelligence-specific
-
requiresReviewThresholdcalibrated against your acceptable false-positive rate. - Templates for vendor-specific document variants tested with sample data.
- Erase flow tested for PDP "right to erasure" requests.
Bank Statement-specific
- Native parsers tested with at least 100 real (sanitized) statements per bank you'll process.
- AI fallback timeout (default 30s) is acceptable for your UX.
- Multi-account auto-split behavior tested for the banks you use.
- Authenticity score threshold for "auto-trust" reviewed.
Orchestration-specific
- All sibling service connections active and health-checked.
- Human approval timeouts (
timeoutHours) reviewed per workflow. - Approval
onTimeoutisroute_to:(recoverable) notfail(terminal) for critical flows. - Workflow specs versioned with clear naming (
-v1,-v2).
AI Automation-specific
- LLM step
outputSchemaconstraint tested with adversarial inputs. - Cron schedules reviewed for timezone correctness.
- Inference budget alarms configured.
- Webhook trigger HMAC secrets rotated from initial provisioning value.
Monitoring
- You're emitting metrics on:
- API call success rate per endpoint
- p50 / p95 latency per endpoint
- 4xx vs 5xx breakdown
- Webhook delivery success rate
- Retry counts
- Alerting wired for:
- 5xx rate exceeds 1%
- p95 latency 2× normal
- Webhook delivery failure rate exceeds 0.5%
- Inference budget 80% consumed (AI Automation)
Disaster recovery
- You can fail over to manual processing if our API is down (degraded mode designed).
- You're not relying on synchronous Quantum Elixir calls in any path that must continue working during our outage (with the explicit exception of evaluate/screening, which are the contract).
- PPATK SAR filing path doesn't depend on us being up (we generate the XML; transmission is yours).
Sign-off
- Compliance officer signed off on AML threshold + rule config.
- Information security officer signed off on key management + storage.
- Product owner signed off on customer-facing failure UX for each product.
- CTO / Head of Engineering signed off on disaster-recovery posture.
We'll help you walk this list
Email integrations@quantumelixir.tech and we'll schedule a pre-production review call with our integration engineering team. Recommended ~2 weeks before your planned launch date.