> ## Documentation Index
> Fetch the complete documentation index at: https://www.adaline.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Overview

> Define criteria that score prompts, logs, datasets, continuous evaluation, and Improve candidates

Evaluators define what good output means for a prompt. They score model responses during prompt evaluations, production monitoring, log review, and Improve cycles.

Use Evaluators when a product rule, quality bar, safety requirement, format contract, cost budget, latency target, or production failure should become a repeatable check.

<img src="https://mintcdn.com/adaline/Um_T8BffW4hfcoYD/images/evaluate/evaluate-product-1.png?fit=max&auto=format&n=Um_T8BffW4hfcoYD&q=85&s=334c06e48b35b579115236925e30b279" alt="Evaluator setup in Adaline showing evaluator configuration for prompt quality checks" title="Evaluator setup" style={{ width: "100%" }} width="1254" height="907" data-path="images/evaluate/evaluate-product-1.png" />

## Evaluation anatomy

Every useful evaluation has five parts:

| Part               | Meaning                                                                | Question it answers                         |
| ------------------ | ---------------------------------------------------------------------- | ------------------------------------------- |
| **Data**           | Dataset rows, production spans, generated cases, or Improve examples.  | Which customer situations are being tested? |
| **Task or prompt** | Prompt version, model settings, variables, tools, and response schema. | What behavior would the user experience?    |
| **Evaluator**      | Criterion used to score the response.                                  | What does the team mean by good enough?     |
| **Result**         | Score, pass/fail, reason, cost, latency, or custom output.             | Did it pass for the right reason?           |
| **Action**         | Approve, reject, edit, add coverage, Improve, deploy, or roll back.    | What changes because of this result?        |

If an evaluator result does not lead to a decision, it is still evidence, but it is not yet an operational quality gate.

## Evaluator types

| Evaluator           | Use it for                                                                   |
| ------------------- | ---------------------------------------------------------------------------- |
| **LLM-as-a-Judge**  | Rubric-based quality, safety, policy, tone, and reasoning checks.            |
| **Custom Prompt**   | LLM-based evaluation with custom model configuration and prompt logic.       |
| **JavaScript**      | Deterministic checks, schema validation, custom scoring, and business rules. |
| **JSON**            | Structured JSON checks and schema-like assertions.                           |
| **API Call**        | External service checks through your own evaluator endpoint.                 |
| **Text Matcher**    | Required or forbidden strings, regexes, and formatting markers.              |
| **Cost**            | Budget thresholds based on provider cost.                                    |
| **Latency**         | SLA thresholds based on runtime.                                             |
| **Response Length** | Word, token, character, or brevity requirements.                             |

Prefer deterministic evaluators for exact rules. Use LLM-based evaluators when the criterion requires judgment, then calibrate them with known passing and failing examples.

## Where evaluators run

<img src="https://mintcdn.com/adaline/6qZ1-Sm8NeEttI_w/images/evaluate/analyze-results.png?fit=max&auto=format&n=6qZ1-Sm8NeEttI_w&q=85&s=d401f5737fbc9646373b8a097da69a52" alt="Evaluation report showing scored prompt outputs and detailed results" title="Evaluation results" style={{ width: "100%" }} width="1290" height="769" data-path="images/evaluate/analyze-results.png" />

| Workflow               | What evaluators do                                                    |
| ---------------------- | --------------------------------------------------------------------- |
| **Prompt evaluations** | Run against datasets before release or during prompt development.     |
| **Monitor and Logs**   | Score sampled production traffic for continuous quality signals.      |
| **Improve**            | Reject candidates that improve one behavior while regressing another. |

Draft evaluators created during an Improve cycle should be reviewed before they become trusted release gates.

## Online and offline evaluation

| Mode                   | Use it for                                                                                            | Source                                                                             |
| ---------------------- | ----------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
| **Offline evaluation** | Pre-release checks, prompt comparison, regression testing, Improve candidate review, and CI/CD gates. | Curated datasets, golden examples, and production failures promoted into datasets. |
| **Online evaluation**  | Continuous monitoring, silent failure detection, release watch, and drift investigation.              | Production logs and spans with useful metadata.                                    |

The strongest loop is: online failure -> log evidence -> dataset row -> evaluator -> offline release gate -> deployment -> online watch.

## Create useful evaluators

<Steps>
  <Step title="Start from a failure mode">
    Use a Behavior, failing log, customer report, or product requirement to define what should pass or fail.
  </Step>

  <Step title="Choose the evaluator type">
    Use deterministic evaluators for exact rules and LLM-as-a-Judge for qualitative criteria.
  </Step>

  <Step title="Attach it to the prompt">
    Link the evaluator where it should run so evaluations and Improve cycles can use it.
  </Step>

  <Step title="Validate against examples">
    Run it against known passing and failing cases before relying on it for approval decisions.
  </Step>
</Steps>

## Coverage checklist

For important prompts, cover the risks customers would notice:

| Risk                   | Coverage example                                                   |
| ---------------------- | ------------------------------------------------------------------ |
| **Correctness**        | LLM-as-a-Judge rubric, JavaScript business rule, or API evaluator. |
| **Safety and policy**  | LLM-as-a-Judge rubric with explicit passing and failing examples.  |
| **Structure**          | JSON, JavaScript, or Text Matcher evaluator.                       |
| **Tool behavior**      | Dataset rows requiring tool use plus output checks.                |
| **Latency**            | Latency evaluator for response-time budgets.                       |
| **Cost and verbosity** | Cost and Response Length evaluators.                               |
| **Known regressions**  | Dataset rows created from production logs or Behaviors.            |

Coverage does not need to be large to be useful. A small dataset with clear evaluators beats a large dataset with vague scoring.

<CardGroup cols={2}>
  <Card title="Evaluator types" icon="list-checks" href="/evaluators/overview#evaluator-types">
    Choose the right evaluator for quality, schema, cost, latency, and custom rules.
  </Card>

  <Card title="Online and offline evaluation" icon="repeat" href="/evaluators/overview#online-and-offline-evaluation">
    Connect curated datasets, production scoring, release gates, and Improve.
  </Card>

  <Card title="Create useful evaluators" icon="plus" href="/evaluators/overview#create-useful-evaluators">
    Turn product requirements and production failures into repeatable checks.
  </Card>

  <Card title="Evaluate prompts" icon="play" href="/evaluate/evaluate-prompts">
    Run prompt evaluations against datasets and review results.
  </Card>

  <Card title="Datasets" icon="database" href="/datasets/overview">
    Store the cases evaluators should score.
  </Card>
</CardGroup>
