Skip to main content

The 10 Readiness Questions

Status: Core Product Philosophy Last Updated: January 3, 2026 Author: Derived from first-principles analysis

The Core Truth

You cannot deterministically evaluate semantic correctness, intent alignment, or harm. Therefore LLM-as-judge is not optional. It is the primary evaluation primitive. Everything else is an optimization layer.
This is the inversion most tools miss.

Why This Matters

  • Tracing tools show what happened
  • Logging tools show when it broke
  • Eval tools show whether it’s acceptable
Only Flightline answers acceptability.

The 10 Readiness Questions

Teams don’t think in “metrics.” They think in ship blockers. Before deploying an AI feature, you need confident answers to:
#QuestionWhat We’re Really AskingPrimary Eval Type
1Intent - Does it do the right thing?Did it complete the task as intended?LLM-judge (intent alignment)
2Grounding - Is it truthful & grounded?Are claims grounded in provided context?LLM-judge + numeric checks
3Hallucination - Did it hallucinate?Did it invent facts, features, or entities?LLM-judge (source faithfulness)
4Rules - Did it follow our rules?Business logic, constraints, policiesHybrid (formulas + judge)
5Safety - Did it avoid harm?No PII leaks, toxicity, dangerous adviceHybrid (substring + judge)
6Consistency - Is it consistent?Same input → similar quality outputStatistical + judge
7Quality - Is it good enough?Meets minimum bar for UX/trustLLM-judge (calibrated)
8Robustness - Is it robust to manipulation?Does it resist adversarial inputs?LLM-judge (adversarial)
9Brand Safety - Is it brand-safe?On-brand tone, no reputation riskLLM-judge (brand alignment)
10Schema - Is the output structurally valid?Correct format, parseable outputDeterministic (schema validation)

The Reality

Only 2 categories are fully deterministic: format (Schema) + performance. Everything else is judgment.

Detailed Failure Modes by Category

GROUNDING (Can it be trusted?)

Failure ModeExampleEvaluation Method
Numeric hallucinationSays “540,000"whensourcesays540,000" when source says 450,000Deterministic: extract & compare numbers
Factual inventionCites a study that doesn’t existLLM-judge: “Does this claim appear in the source?”
Entity confusionAttributes quote to wrong personLLM-judge: cross-reference entities
Date/time errorsSays “submitted March 5” when it was March 15Deterministic if structured; LLM if prose
Feature hallucinationDescribes a product feature that doesn’t existLLM-judge with product knowledge

BEHAVIOR (Does it do the job?)

Failure ModeExampleEvaluation Method
Task incompletionAsked to summarize 5 points, only covers 3LLM-judge: “Were all requested elements addressed?”
Instruction violationTold “be brief”, writes 2000 wordsDeterministic (word count) + LLM (spirit of brevity)
Wrong task entirelyAsked for analysis, gives a storyLLM-judge: task classification
Off-topic driftStarts answering, veers into unrelated territoryLLM-judge: relevance scoring
Missing the pointTechnically answers but misses what user really wantedLLM-judge: intent alignment

CONSISTENCY (Can it be relied upon?)

Failure ModeExampleEvaluation Method
Self-contradictionSays “approved” in one sentence, “rejected” in anotherLLM-judge: internal consistency check
Context contradictionOutput contradicts the document it was givenLLM-judge: source comparison
Logical incoherenceConclusion doesn’t follow from reasoningLLM-judge: logical validity
Tone inconsistencyProfessional brief suddenly becomes casualLLM-judge: style consistency
Cross-run varianceSame prompt gives wildly different qualityStatistical + LLM quality scoring

SAFETY (Will it cause harm?)

Failure ModeExampleEvaluation Method
PII leakageOutputs real customer email from trainingDeterministic: substring/regex for known PII
Prompt injection successUser tricks it into ignoring instructionsLLM-judge: did it follow system prompt?
Toxic contentGenerates offensive languageClassifier + LLM-judge
Dangerous adviceMedical/legal/financial misguidanceDomain-specific LLM-judge
JailbreakBypasses safety guardrailsLLM-judge: refusal detection

QUALITY (Is it good enough?)

Failure ModeExampleEvaluation Method
UnhelpfulTechnically correct but uselessLLM-judge: helpfulness rating
UnclearConfusing or ambiguous phrasingLLM-judge: clarity scoring
Wrong detail levelToo verbose or too sparseLLM-judge: appropriateness for context
Poor structureWall of text when bullets expectedFormat check + LLM-judge
Wrong toneFormal when casual needed, or vice versaLLM-judge: tone classification

FORMAT (Is it machine-parseable?)

Failure ModeExampleEvaluation Method
Invalid JSON/XMLMissing brackets, syntax errorsDeterministic: parse it
Schema mismatchWrong field names, missing required fieldsDeterministic: schema validation
Type errorsString where number expectedDeterministic: type checking
Encoding issuesBroken unicode, escape sequencesDeterministic

RELIABILITY (Will it work in production?)

Failure ModeExampleEvaluation Method
Latency spikeResponse takes 30s instead of 2sDeterministic: timing
Token explosionUses 10x expected tokensDeterministic: count
Rate limit hitToo many requestsDeterministic: monitoring
Empty responseReturns nothingDeterministic
TimeoutNever completesDeterministic

The Three-Tier Evaluation Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     TIER 1: HARD CHECKS                         │
│              (Deterministic, fast-fail, always run)             │
├─────────────────────────────────────────────────────────────────┤
│  • JSON/schema validation                                       │
│  • Required field presence                                      │
│  • Type checking                                                │
│  • Numeric extraction & comparison                              │
│  • PII substring detection                                      │
│  • Latency/token bounds                                         │
│  • Regex pattern matching                                       │
└─────────────────────────────────────────────────────────────────┘

                              ▼ (if pass)
┌─────────────────────────────────────────────────────────────────┐
│                     TIER 2: SOFT CHECKS                         │
│              (LLM-judge with structured rubrics)                │
├─────────────────────────────────────────────────────────────────┤
│  • Intent alignment / task completion                           │
│  • Factual grounding / source faithfulness                      │
│  • Hallucination detection                                      │
│  • Internal consistency                                         │
│  • Safety & harm assessment                                     │
│  • Quality / helpfulness scoring                                │
│  • Tone & style appropriateness                                 │
└─────────────────────────────────────────────────────────────────┘

                              ▼ (aggregate over time)
┌─────────────────────────────────────────────────────────────────┐
│                     TIER 3: META CHECKS                         │
│              (Statistical, drift detection)                     │
├─────────────────────────────────────────────────────────────────┤
│  • Cross-run variance                                           │
│  • Quality degradation over time                                │
│  • Score distribution shifts                                    │
│  • Baseline comparison                                          │
└─────────────────────────────────────────────────────────────────┘

LLM-Judge as a First-Class Primitive

Most tools treat LLM-as-judge as:
  • A last resort
  • A hack
  • A research toy
Flightline treats it as:
  • The core evaluation engine
  • Structured and auditable
  • Calibrated against known examples
  • The only way to answer semantic questions

Guardrails for Trustworthy LLM-Judgment

  1. Rubric-based judging - Explicit criteria, not “is this good?”
  2. Dimension-isolated scoring - Score each aspect independently
  3. Binary ship-blocking thresholds - Clear pass/fail cutoffs
  4. Known-good / known-bad calibration - Ground the judge in examples
  5. Judge disagreement detection - Flag uncertain evaluations
  6. Confidence intervals - Report uncertainty, not just scores
You’re not “asking the model if it’s good.”
You’re running semantic unit tests.

The Eval Unlock Model

Customers don’t want to design evals.
They want to unlock the evals that matter for their product.
Evaluation CapabilityWhat Flightline NeedsWhy
Task correctnessPrompt + user intentJudge needs to know what “right” means
Grounding / faithfulnessSource documents / contextDetect hallucinations against truth
Rule complianceBusiness rules / policiesEnforce domain constraints
Tone & styleExample outputs or tone specEvaluate UX trust
Safety & harmDomain + risk toleranceMedical ≠ SaaS ≠ Legal
ConsistencyMultiple runs or logsMeasure variance
Quality bar”Ship-ready vs reject” examplesCalibrate subjective threshold
Format correctnessSchemaDeterministic validation

The Ship Readiness View

The question teams actually ask:
“Can I ship this?”
QuestionStatusConfidenceEvidence
Does it do the right thing?⚠️0.72Intent mismatch in 2/20
Is it grounded?0.41Hallucinated entity
Rule compliant?0.94All checks passed
Safe?⚠️0.81Edge-case advice flagged
Consistent?0.58High variance across runs
Good enough to ship?0.66Below quality bar

Strategic Positioning

The Thesis

“Flightline answers the questions that block AI from shipping.”
Or:
“Before you ship AI, Flightline tells you if it’s safe, correct, and good enough.”

Why This Is Defensible

  • LangSmith exposes primitives
  • Langfuse exposes traces
  • OpenAI Evals exposes building blocks
Flightline exposes decisions.

The Key Insight

LLM-judge is not a weakness.
Pretending deterministic evals are enough is the weakness.

Appendix: What Each Eval Type Proves

If This Passes…You Can Say…
Format validation”The output is machine-parseable”
Schema validation”The output has the right shape”
Numeric grounding”The numbers are accurate”
PII detection”No known PII was leaked”
Intent alignment (judge)“It did what was asked”
Source faithfulness (judge)“It didn’t make things up”
Safety assessment (judge)“It’s unlikely to cause harm”
Quality scoring (judge)“It meets our bar for good”
Consistency (statistical)“It behaves predictably”
No single check proves “safe to ship.”
All checks together build confidence.

References