Skip to main content

flightline eval

The eval command is used to run comparisons between actual AI behavior and expected outcomes. It is the primary tool for functional regression testing, ensuring that your AI still produces the correct results after a prompt or model change.

Usage

flightline eval scenarios [OPTIONS]

How it Works

Evaluation involves three main components:
  1. Scenarios: The inputs and “ground truth” expected outputs (usually created with flightline generate).
  2. Traces: The actual outputs captured from your system (recorded with fltrace).
  3. Matchers: The logic used to compare actual vs. expected results.

Key Options

OptionDescription
--scenariosPath to the scenario file to use as a benchmark.
--tracesPath to the captured traces to evaluate.
--config, -cPath to the evaluation spec (default: flightline.eval.yaml).

The Evaluation Spec

You can define how evaluations should behave in your flightline.yaml or a dedicated eval.yaml. This includes:
  • Deterministic Matchers: Exact string matches, regex patterns, or numeric range checks.
  • LLM-Judge Matchers: Semantic similarity, tone checks, or rubric-based scoring for qualitative outputs.

Example

$ flightline eval scenarios

     STATUS: [RDY] LOADING SCENARIOS
     ◉── LOADED 20 SCENARIOS

WP01 ─╼ RUNNING EVALUATION: ACTUAL vs EXPECTED
 18 PASS
 2 FAIL

FAIL: sc_012 (urgent_leak_detection)
  EXPECTED: "Priority: High"
  ACTUAL: "Priority: Normal"
  REASON: Failed to identify severity keywords in input.

 Evaluation complete.

Next Steps

While eval focuses on specific functional benchmarks, the check command provides a more holistic view of system health and risk.

Ship-Readiness Gate

Answer the 7 ship-blocking questions for your entire system.