The 10 Readiness Questions

Status: Core Product Philosophy Last Updated: January 3, 2026 Author: Derived from first-principles analysis

The Core Truth

You cannot deterministically evaluate semantic correctness, intent alignment, or harm. Therefore LLM-as-judge is not optional. It is the primary evaluation primitive. Everything else is an optimization layer.

This is the inversion most tools miss.

Why This Matters

Tracing tools show what happened
Logging tools show when it broke
Eval tools show whether it’s acceptable

Only Flightline answers acceptability.

The 10 Readiness Questions

Teams don’t think in “metrics.” They think in ship blockers. Before deploying an AI feature, you need confident answers to:

#	Question	What We’re Really Asking	Primary Eval Type
1	Intent - Does it do the right thing?	Did it complete the task as intended?	LLM-judge (intent alignment)
2	Grounding - Is it truthful & grounded?	Are claims grounded in provided context?	LLM-judge + numeric checks
3	Hallucination - Did it hallucinate?	Did it invent facts, features, or entities?	LLM-judge (source faithfulness)
4	Rules - Did it follow our rules?	Business logic, constraints, policies	Hybrid (formulas + judge)
5	Safety - Did it avoid harm?	No PII leaks, toxicity, dangerous advice	Hybrid (substring + judge)
6	Consistency - Is it consistent?	Same input → similar quality output	Statistical + judge
7	Quality - Is it good enough?	Meets minimum bar for UX/trust	LLM-judge (calibrated)
8	Robustness - Is it robust to manipulation?	Does it resist adversarial inputs?	LLM-judge (adversarial)
9	Brand Safety - Is it brand-safe?	On-brand tone, no reputation risk	LLM-judge (brand alignment)
10	Schema - Is the output structurally valid?	Correct format, parseable output	Deterministic (schema validation)

The Reality

Only 2 categories are fully deterministic: format (Schema) + performance. Everything else is judgment.

Detailed Failure Modes by Category

GROUNDING (Can it be trusted?)

Failure Mode	Example	Evaluation Method
Numeric hallucination	Says “ $540,000" when source says$ 450,000	Deterministic: extract & compare numbers
Factual invention	Cites a study that doesn’t exist	LLM-judge: “Does this claim appear in the source?”
Entity confusion	Attributes quote to wrong person	LLM-judge: cross-reference entities
Date/time errors	Says “submitted March 5” when it was March 15	Deterministic if structured; LLM if prose
Feature hallucination	Describes a product feature that doesn’t exist	LLM-judge with product knowledge

BEHAVIOR (Does it do the job?)

Failure Mode	Example	Evaluation Method
Task incompletion	Asked to summarize 5 points, only covers 3	LLM-judge: “Were all requested elements addressed?”
Instruction violation	Told “be brief”, writes 2000 words	Deterministic (word count) + LLM (spirit of brevity)
Wrong task entirely	Asked for analysis, gives a story	LLM-judge: task classification
Off-topic drift	Starts answering, veers into unrelated territory	LLM-judge: relevance scoring
Missing the point	Technically answers but misses what user really wanted	LLM-judge: intent alignment

CONSISTENCY (Can it be relied upon?)

Failure Mode	Example	Evaluation Method
Self-contradiction	Says “approved” in one sentence, “rejected” in another	LLM-judge: internal consistency check
Context contradiction	Output contradicts the document it was given	LLM-judge: source comparison
Logical incoherence	Conclusion doesn’t follow from reasoning	LLM-judge: logical validity
Tone inconsistency	Professional brief suddenly becomes casual	LLM-judge: style consistency
Cross-run variance	Same prompt gives wildly different quality	Statistical + LLM quality scoring

SAFETY (Will it cause harm?)

Failure Mode	Example	Evaluation Method
PII leakage	Outputs real customer email from training	Deterministic: substring/regex for known PII
Prompt injection success	User tricks it into ignoring instructions	LLM-judge: did it follow system prompt?
Toxic content	Generates offensive language	Classifier + LLM-judge
Dangerous advice	Medical/legal/financial misguidance	Domain-specific LLM-judge
Jailbreak	Bypasses safety guardrails	LLM-judge: refusal detection

QUALITY (Is it good enough?)

Failure Mode	Example	Evaluation Method
Unhelpful	Technically correct but useless	LLM-judge: helpfulness rating
Unclear	Confusing or ambiguous phrasing	LLM-judge: clarity scoring
Wrong detail level	Too verbose or too sparse	LLM-judge: appropriateness for context
Poor structure	Wall of text when bullets expected	Format check + LLM-judge
Wrong tone	Formal when casual needed, or vice versa	LLM-judge: tone classification

FORMAT (Is it machine-parseable?)

Failure Mode	Example	Evaluation Method
Invalid JSON/XML	Missing brackets, syntax errors	Deterministic: parse it
Schema mismatch	Wrong field names, missing required fields	Deterministic: schema validation
Type errors	String where number expected	Deterministic: type checking
Encoding issues	Broken unicode, escape sequences	Deterministic

RELIABILITY (Will it work in production?)

Failure Mode	Example	Evaluation Method
Latency spike	Response takes 30s instead of 2s	Deterministic: timing
Token explosion	Uses 10x expected tokens	Deterministic: count
Rate limit hit	Too many requests	Deterministic: monitoring
Empty response	Returns nothing	Deterministic
Timeout	Never completes	Deterministic

The Three-Tier Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     TIER 1: HARD CHECKS                         │
│              (Deterministic, fast-fail, always run)             │
├─────────────────────────────────────────────────────────────────┤
│  • JSON/schema validation                                       │
│  • Required field presence                                      │
│  • Type checking                                                │
│  • Numeric extraction & comparison                              │
│  • PII substring detection                                      │
│  • Latency/token bounds                                         │
│  • Regex pattern matching                                       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (if pass)
┌─────────────────────────────────────────────────────────────────┐
│                     TIER 2: SOFT CHECKS                         │
│              (LLM-judge with structured rubrics)                │
├─────────────────────────────────────────────────────────────────┤
│  • Intent alignment / task completion                           │
│  • Factual grounding / source faithfulness                      │
│  • Hallucination detection                                      │
│  • Internal consistency                                         │
│  • Safety & harm assessment                                     │
│  • Quality / helpfulness scoring                                │
│  • Tone & style appropriateness                                 │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ (aggregate over time)
┌─────────────────────────────────────────────────────────────────┐
│                     TIER 3: META CHECKS                         │
│              (Statistical, drift detection)                     │
├─────────────────────────────────────────────────────────────────┤
│  • Cross-run variance                                           │
│  • Quality degradation over time                                │
│  • Score distribution shifts                                    │
│  • Baseline comparison                                          │
└─────────────────────────────────────────────────────────────────┘

LLM-Judge as a First-Class Primitive

Most tools treat LLM-as-judge as:

A last resort
A hack
A research toy

Flightline treats it as:

The core evaluation engine
Structured and auditable
Calibrated against known examples
The only way to answer semantic questions

Guardrails for Trustworthy LLM-Judgment

Rubric-based judging - Explicit criteria, not “is this good?”
Dimension-isolated scoring - Score each aspect independently
Binary ship-blocking thresholds - Clear pass/fail cutoffs
Known-good / known-bad calibration - Ground the judge in examples
Judge disagreement detection - Flag uncertain evaluations
Confidence intervals - Report uncertainty, not just scores

You’re not “asking the model if it’s good.”
You’re running semantic unit tests.

The Eval Unlock Model

Customers don’t want to design evals.
They want to unlock the evals that matter for their product.

Evaluation Capability	What Flightline Needs	Why
Task correctness	Prompt + user intent	Judge needs to know what “right” means
Grounding / faithfulness	Source documents / context	Detect hallucinations against truth
Rule compliance	Business rules / policies	Enforce domain constraints
Tone & style	Example outputs or tone spec	Evaluate UX trust
Safety & harm	Domain + risk tolerance	Medical ≠ SaaS ≠ Legal
Consistency	Multiple runs or logs	Measure variance
Quality bar	”Ship-ready vs reject” examples	Calibrate subjective threshold
Format correctness	Schema	Deterministic validation

The Ship Readiness View

The question teams actually ask:

“Can I ship this?”

Question	Status	Confidence	Evidence
Does it do the right thing?	⚠️	0.72	Intent mismatch in 2/20
Is it grounded?	❌	0.41	Hallucinated entity
Rule compliant?	✅	0.94	All checks passed
Safe?	⚠️	0.81	Edge-case advice flagged
Consistent?	❌	0.58	High variance across runs
Good enough to ship?	❌	0.66	Below quality bar

Strategic Positioning

The Thesis

“Flightline answers the questions that block AI from shipping.”

Or:

“Before you ship AI, Flightline tells you if it’s safe, correct, and good enough.”

Why This Is Defensible

LangSmith exposes primitives
Langfuse exposes traces
OpenAI Evals exposes building blocks

Flightline exposes decisions.

The Key Insight

LLM-judge is not a weakness.
Pretending deterministic evals are enough is the weakness.

Appendix: What Each Eval Type Proves

If This Passes…	You Can Say…
Format validation	”The output is machine-parseable”
Schema validation	”The output has the right shape”
Numeric grounding	”The numbers are accurate”
PII detection	”No known PII was leaked”
Intent alignment (judge)	“It did what was asked”
Source faithfulness (judge)	“It didn’t make things up”
Safety assessment (judge)	“It’s unlikely to cause harm”
Quality scoring (judge)	“It meets our bar for good”
Consistency (statistical)	“It behaves predictably”

No single check proves “safe to ship.”
All checks together build confidence.

References

PRODUCT_CONTEXT.md - Product vision and strategy
ThreatMatrix.jsx - UI component for regression categories
docs/concepts/fact-checker.mdx - Fact-Checker concept documentation

Getting Started

UI Reference

CLI Reference

Concepts

Integration

Configuration

Readiness questions

The 10 Readiness Questions

The Core Truth

Why This Matters

The 10 Readiness Questions

The Reality

Detailed Failure Modes by Category

GROUNDING (Can it be trusted?)

BEHAVIOR (Does it do the job?)

CONSISTENCY (Can it be relied upon?)

SAFETY (Will it cause harm?)

QUALITY (Is it good enough?)

FORMAT (Is it machine-parseable?)

RELIABILITY (Will it work in production?)

The Three-Tier Evaluation Architecture

LLM-Judge as a First-Class Primitive

Guardrails for Trustworthy LLM-Judgment

The Eval Unlock Model

The Ship Readiness View

Strategic Positioning

The Thesis

Why This Is Defensible

The Key Insight

Appendix: What Each Eval Type Proves

References

Getting Started

UI Reference

CLI Reference

Concepts

Integration

Configuration

​The 10 Readiness Questions

​The Core Truth

​Why This Matters

​The 10 Readiness Questions

​The Reality

​Detailed Failure Modes by Category

​GROUNDING (Can it be trusted?)

​BEHAVIOR (Does it do the job?)

​CONSISTENCY (Can it be relied upon?)

​SAFETY (Will it cause harm?)

​QUALITY (Is it good enough?)

​FORMAT (Is it machine-parseable?)

​RELIABILITY (Will it work in production?)

​The Three-Tier Evaluation Architecture

​LLM-Judge as a First-Class Primitive

​Guardrails for Trustworthy LLM-Judgment

​The Eval Unlock Model

​The Ship Readiness View

​Strategic Positioning

​The Thesis

​Why This Is Defensible

​The Key Insight

​Appendix: What Each Eval Type Proves

​References

The 10 Readiness Questions

The Core Truth

Why This Matters

The 10 Readiness Questions

The Reality

Detailed Failure Modes by Category

GROUNDING (Can it be trusted?)

BEHAVIOR (Does it do the job?)

CONSISTENCY (Can it be relied upon?)

SAFETY (Will it cause harm?)

QUALITY (Is it good enough?)

FORMAT (Is it machine-parseable?)

RELIABILITY (Will it work in production?)

The Three-Tier Evaluation Architecture

LLM-Judge as a First-Class Primitive

Guardrails for Trustworthy LLM-Judgment

The Eval Unlock Model

The Ship Readiness View

Strategic Positioning

The Thesis

Why This Is Defensible

The Key Insight

Appendix: What Each Eval Type Proves

References