Skip to main content

The Fact-Checker

The Fact-Checker is Flightline’s validation engine. It catches hallucinations, numerical errors, and regressions before they reach production.

The Problem It Solves

LLMs are powerful but unreliable. They can:
  • Hallucinate numbers: Output “540,000"whentheinputwas540,000" when the input was 450,000
  • Miss safety triggers: Approve a loan that should be rejected
  • Regress silently: A prompt change breaks edge cases you weren’t testing
The Fact-Checker provides deterministic verification of LLM outputs.

How It Works

Input:  Synthetic test data (from Fabricator)
        + LLM output (from your prompt)

Process: Apply validation checks
         Compare against ground truth
         Detect regressions

Output: Pass/Fail report

Validation Checks

Numerical Consistency

The most critical check for financial applications. The Fact-Checker extracts all numbers from the LLM output and verifies each one exists in the source data.
❌ NumericalConsistency FAILED

Input JSON:
  {"loan_amount": 450000}

LLM Output:
  "Your loan of $540,000..."

Discrepancy:
  Expected: 450000
  Found:    540000
This catches the costly errors that keep engineering leaders up at night: numerical hallucinations that could cause real financial harm.

Safety Guardrails

Verify that safety-critical responses trigger correctly:
❌ SafetyGuardrail FAILED

Scenario: credit_score=420, loan_type="jumbo"

Expected: Rejection response
Received: "Congratulations! Your Jumbo loan is approved..."

The Golden Ledger

Over time, the Fact-Checker builds a Golden Ledger: a permanent regression suite of scenarios that have caused failures. When a scenario fails, it’s added to the Ledger. Future runs always include these scenarios, ensuring you never regress on known issues. The Golden Ledger grows smarter over time. Every bug you catch becomes a permanent regression test.

CI/CD Integration

The Fact-Checker is designed for CI pipelines. When evaluations fail, the pipeline fails, blocking bad prompts from reaching production.

What’s Next