Evals
An eval run executes a generation template (or a built-in generation kind) against a list of input cases and records per-case pass/fail + output preview + inference latency. Use this to:
- Validate a new prompt against a labeled test set before shipping.
- Regression-check after model upgrades — re-run yesterday's eval against the new model and compare pass rate.
- Compare two prompts head-to-head by running each as a separate eval and diffing the metrics.
This is an offline batch tool, not a guardrail on production traffic. For production guardrails, use outputSchema validation on llm_step nodes — see Flows →.
Eval-run object
{
"id": "evl_01HXY...",
"name": "Borrower-summary prompt v3",
"templateId": "gtp_01HXY...",
"inputSet": [
{ "borrower_name": "Budi", "monthly_income": 12000000, "loan_amount": 50000000 },
{ "borrower_name": "Siti", "monthly_income": 8500000, "loan_amount": 30000000 }
],
"status": "completed",
"totalCases": 2,
"passedCases": 2,
"metrics": {
"averageInferenceMs": 1340,
"results": [
{ "passed": true, "outputPreview": "Budi qualifies for the requested amount with a DTI of ...", "inferenceMs": 1280 },
{ "passed": true, "outputPreview": "Siti shows stable income but DTI exceeds policy. Recommend ...", "inferenceMs": 1400 }
]
},
"startedAt": "2026-05-25T08:14:22Z",
"completedAt": "2026-05-25T08:14:31Z"
}| Field | Notes |
|---|---|
templateId | Optional — when present, each case is rendered through that template's prompt + model tier. |
inputSet | Array of objects. Each is one test case. Required keys are template-dependent. |
status | pending → running → completed · failed. |
totalCases / passedCases | Pass rate = passedCases / totalCases. |
metrics.results[] | Per-case detail. outputPreview truncates at 200 chars. |
Pass criterion
The current pass criterion is intentionally loose: a case passes if the LLM returned any non-empty output and didn't throw. The intent is to detect regressions (output blank, model rejected the prompt, schema validation failed) rather than to judge subjective quality.
Per-template assertion rules (regex / schema / golden-output match) are on the roadmap — until then, layer your own assertions by post-processing the metrics.results[] array in your CI.
Inputs without a template
If you omit templateId, each input is expected to include a _kind key pointing to a built-in generation kind (e.g. borrower_summary, complaint_classify). The runner uses the standard tier for that kind.
Endpoints
/api/evalscurl -X POST .../api/evals \
-H "Authorization: Bearer $QE_API_KEY" \
-d '{
"name": "Borrower-summary prompt v3",
"templateId": "gtp_01HXY...",
"inputSet": [
{ "borrower_name": "Budi", "monthly_income": 12000000, "loan_amount": 50000000 },
{ "borrower_name": "Siti", "monthly_income": 8500000, "loan_amount": 30000000 }
]
}'Response is 202 Accepted — eval runs are queued. Poll the eval-run row for status: completed or subscribe to the eval.completed webhook.
/api/evalsReturns the 50 most recent eval runs for the org.
/api/evals/{id}Returns the full eval-run row, including metrics.results[].
Webhook
| Event | Fires when |
|---|---|
eval.completed | Eval run transitions to completed or failed. Payload: evalId, passRate, summary. |
Suggested CI pattern
Wire a CI job that, on prompt-template change:
- POST
/api/evalswith the new template + your labeledinputSet. - Wait for
eval.completedwebhook (or poll the row). - Compare
passedCases / totalCasesto the previous eval for the same template name. - Fail the build if pass rate dropped below an org-policy threshold (e.g. -5%).
This is how you make prompt changes safe to deploy.
Evals consume inference budget
Each eval case is a full LLM call. A 500-case eval is 500 generations. Schedule expensive evals overnight or behind a feature flag — your org's inference quota covers them.