> ## Documentation Index
> Fetch the complete documentation index at: https://www.adaline.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Auto Generated Evaluators

> Understand how Improve turns production behavior into evaluator coverage for prompt candidates

Auto Generated Evaluators are production-derived checks created from logs, Behaviors, feedback, and representative examples. They help an Improve cycle score candidates against the issue that actually appeared in production.

Use them to cover newly discovered failures quickly. Review them before treating them as durable release gates.

<img src="https://mintcdn.com/adaline/o8h3k4eQQbaIV193/images/platform-v2/improve/review-regression-runtime.png?fit=max&auto=format&n=o8h3k4eQQbaIV193&q=85&s=26c103009dc205f9955b35784e10790a" alt="Improve regression report showing authored and auto generated evaluators scored against the baseline and candidate" title="Human-authored and AI-derived coverage" style={{ width: "100%" }} width="1318" height="1024" data-path="images/platform-v2/improve/review-regression-runtime.png" />

## Human-authored vs auto generated

| Source                       | What it is for                                                             | Treat it as                                  |
| ---------------------------- | -------------------------------------------------------------------------- | -------------------------------------------- |
| **Human-authored evaluator** | Known product policy, safety, quality, format, or regression requirements. | Stable release coverage when calibrated.     |
| **Auto Generated Evaluator** | A newly observed production pattern that needs fast scoring coverage.      | Draft or supporting evidence until reviewed. |

Generated evaluators should still have readable criteria, representative examples, and pass/fail behavior that matches human judgment.

## Where they appear

Auto Generated Evaluators appear in the **Evals** stage of an Improve cycle.

| State               | Meaning                                                                        | Reviewer action                                                   |
| ------------------- | ------------------------------------------------------------------------------ | ----------------------------------------------------------------- |
| **Covered**         | Generated evaluators are available for candidate scoring.                      | Review criteria and examples before trusting the score.           |
| **Started**         | Generation is running from selected evidence.                                  | Wait for completion or continue with authored coverage if urgent. |
| **Awaiting review** | A generated evaluator needs human approval before it becomes trusted coverage. | Publish it only if it captures the failure correctly.             |
| **Unavailable**     | There is not enough suitable evidence.                                         | Add logs, Behavior evidence, or a manual evaluator.               |
| **Failed**          | The pipeline could not produce usable coverage.                                | Improve evidence quality or write the evaluator manually.         |

## Review checklist

Before relying on an auto generated evaluator, make sure it names the user-visible failure, passes healthy examples, catches the bad examples it was created for, and does not block unrelated behavior. It should be readable, privacy-safe, grounded in enough examples to explain its decisions, and close to how a human reviewer would judge representative cases. If the idea is useful but the criteria are too broad, tighten it manually before making it durable coverage.

## When to write one yourself

Use a human-authored evaluator when the requirement is explicit, high-risk, shared across prompts, or ambiguous enough that automation should not define "good" by itself.

Examples: compliance policy, medical or financial safety, structured output format, brand tone, tool correctness, or a high-volume regression the team already understands.

## After the cycle

Good generated evaluators should become part of the system's memory:

1. Keep the checks that explain why the approved candidate worked.
2. Rewrite noisy criteria before making them a release gate.
3. Link important evaluators to the prompt and relevant datasets.
4. Ignore or delete generated checks that overfit a single cluster.

The goal is compounding coverage: each repeated production issue should leave behind a better evaluator, dataset row, or review note.

When a cycle is approved or approved with edits, the AI-generated evaluators from that cycle are saved in your project so they can be reused in evaluations, Playground runs, and continuous evaluations.

<CardGroup cols={2}>
  <Card title="Synthetic Datasets" icon="database" href="/improve/synthetic-datasets">
    Understand generated cases and production-derived validation data.
  </Card>

  <Card title="Evaluator overview" icon="flask-conical" href="/evaluators/overview">
    Maintain the evaluator library outside an Improve cycle.
  </Card>
</CardGroup>
