📘 Public beta · Endpoints are stable; OpenAPI specs and SDKs ship monthly. See changelog →
Products
AI Automation
Evals

Evals

An eval run executes a generation template (or a built-in generation kind) against a list of input cases and records per-case pass/fail + output preview + inference latency. Use this to:

  • Validate a new prompt against a labeled test set before shipping.
  • Regression-check after model upgrades — re-run yesterday's eval against the new model and compare pass rate.
  • Compare two prompts head-to-head by running each as a separate eval and diffing the metrics.

This is an offline batch tool, not a guardrail on production traffic. For production guardrails, use outputSchema validation on llm_step nodes — see Flows →.

Eval-run object

{
  "id": "evl_01HXY...",
  "name": "Borrower-summary prompt v3",
  "templateId": "gtp_01HXY...",
  "inputSet": [
    { "borrower_name": "Budi", "monthly_income": 12000000, "loan_amount": 50000000 },
    { "borrower_name": "Siti", "monthly_income": 8500000,  "loan_amount": 30000000 }
  ],
  "status": "completed",
  "totalCases": 2,
  "passedCases": 2,
  "metrics": {
    "averageInferenceMs": 1340,
    "results": [
      { "passed": true, "outputPreview": "Budi qualifies for the requested amount with a DTI of ...", "inferenceMs": 1280 },
      { "passed": true, "outputPreview": "Siti shows stable income but DTI exceeds policy. Recommend ...", "inferenceMs": 1400 }
    ]
  },
  "startedAt": "2026-05-25T08:14:22Z",
  "completedAt": "2026-05-25T08:14:31Z"
}
FieldNotes
templateIdOptional — when present, each case is rendered through that template's prompt + model tier.
inputSetArray of objects. Each is one test case. Required keys are template-dependent.
statuspendingrunningcompleted · failed.
totalCases / passedCasesPass rate = passedCases / totalCases.
metrics.results[]Per-case detail. outputPreview truncates at 200 chars.

Pass criterion

The current pass criterion is intentionally loose: a case passes if the LLM returned any non-empty output and didn't throw. The intent is to detect regressions (output blank, model rejected the prompt, schema validation failed) rather than to judge subjective quality.

Per-template assertion rules (regex / schema / golden-output match) are on the roadmap — until then, layer your own assertions by post-processing the metrics.results[] array in your CI.

Inputs without a template

If you omit templateId, each input is expected to include a _kind key pointing to a built-in generation kind (e.g. borrower_summary, complaint_classify). The runner uses the standard tier for that kind.

Endpoints

POST/api/evals
Auth · API keyScope · evals:write
curl -X POST .../api/evals \
  -H "Authorization: Bearer $QE_API_KEY" \
  -d '{
    "name": "Borrower-summary prompt v3",
    "templateId": "gtp_01HXY...",
    "inputSet": [
      { "borrower_name": "Budi", "monthly_income": 12000000, "loan_amount": 50000000 },
      { "borrower_name": "Siti", "monthly_income": 8500000,  "loan_amount": 30000000 }
    ]
  }'

Response is 202 Accepted — eval runs are queued. Poll the eval-run row for status: completed or subscribe to the eval.completed webhook.

GET/api/evals
Auth · API keyScope · evals:read

Returns the 50 most recent eval runs for the org.

GET/api/evals/{id}
Auth · API keyScope · evals:read

Returns the full eval-run row, including metrics.results[].

Webhook

EventFires when
eval.completedEval run transitions to completed or failed. Payload: evalId, passRate, summary.

Suggested CI pattern

Wire a CI job that, on prompt-template change:

  1. POST /api/evals with the new template + your labeled inputSet.
  2. Wait for eval.completed webhook (or poll the row).
  3. Compare passedCases / totalCases to the previous eval for the same template name.
  4. Fail the build if pass rate dropped below an org-policy threshold (e.g. -5%).

This is how you make prompt changes safe to deploy.

Evals consume inference budget

Each eval case is a full LLM call. A 500-case eval is 500 generations. Schedule expensive evals overnight or behind a feature flag — your org's inference quota covers them.