flightline eval
The eval command is used to run comparisons between actual AI behavior and expected outcomes. It is the primary tool for functional regression testing, ensuring that your AI still produces the correct results after a prompt or model change.
Usage
How it Works
Evaluation involves three main components:- Scenarios: The inputs and “ground truth” expected outputs (usually created with
flightline generate). - Traces: The actual outputs captured from your system (recorded with
fltrace). - Matchers: The logic used to compare actual vs. expected results.
Key Options
| Option | Description |
|---|---|
--scenarios | Path to the scenario file to use as a benchmark. |
--traces | Path to the captured traces to evaluate. |
--config, -c | Path to the evaluation spec (default: flightline.eval.yaml). |
The Evaluation Spec
You can define how evaluations should behave in yourflightline.yaml or a dedicated eval.yaml. This includes:
- Deterministic Matchers: Exact string matches, regex patterns, or numeric range checks.
- LLM-Judge Matchers: Semantic similarity, tone checks, or rubric-based scoring for qualitative outputs.
Example
Next Steps
Whileeval focuses on specific functional benchmarks, the check command provides a more holistic view of system health and risk.
Ship-Readiness Gate
Answer the 7 ship-blocking questions for your entire system.
