`flightline eval`

The eval command is used to run comparisons between actual AI behavior and expected outcomes. It is the primary tool for functional regression testing, ensuring that your AI still produces the correct results after a prompt or model change.

Usage

flightline eval scenarios [OPTIONS]

How it Works

Evaluation involves three main components:

Scenarios: The inputs and “ground truth” expected outputs (usually created with flightline generate).
Traces: The actual outputs captured from your system (recorded with fltrace).
Matchers: The logic used to compare actual vs. expected results.

Key Options

Option	Description
`--scenarios`	Path to the scenario file to use as a benchmark.
`--traces`	Path to the captured traces to evaluate.
`--config, -c`	Path to the evaluation spec (default: `flightline.eval.yaml`).

The Evaluation Spec

You can define how evaluations should behave in your flightline.yaml or a dedicated eval.yaml. This includes:

Deterministic Matchers: Exact string matches, regex patterns, or numeric range checks.
LLM-Judge Matchers: Semantic similarity, tone checks, or rubric-based scoring for qualitative outputs.

Example

$ flightline eval scenarios

     STATUS: ◉ [RDY] LOADING SCENARIOS
     ◉── LOADED 20 SCENARIOS

WP01 ─╼ RUNNING EVALUATION: ACTUAL vs EXPECTED
     ► 18 PASS
     ► 2 FAIL

FAIL: sc_012 (urgent_leak_detection)
  EXPECTED: "Priority: High"
  ACTUAL: "Priority: Normal"
  REASON: Failed to identify severity keywords in input.

✓ Evaluation complete.

Next Steps

While eval focuses on specific functional benchmarks, the check command provides a more holistic view of system health and risk.

Ship-Readiness Gate

Answer the 7 ship-blocking questions for your entire system.

Getting Started

UI Reference

CLI Reference

Concepts

Integration

Configuration

Eval

`flightline eval`

Usage

How it Works

Key Options

The Evaluation Spec

Example

Next Steps

Ship-Readiness Gate

Getting Started

UI Reference

CLI Reference

Concepts

Integration

Configuration

​flightline eval

​Usage

​How it Works

​Key Options

​The Evaluation Spec

​Example

​Next Steps

Ship-Readiness Gate

`flightline eval`

Usage

How it Works

Key Options

The Evaluation Spec

Example

Next Steps