The 10 Readiness Questions
Status: Core Product Philosophy Last Updated: January 3, 2026 Author: Derived from first-principles analysisThe Core Truth
You cannot deterministically evaluate semantic correctness, intent alignment, or harm. Therefore LLM-as-judge is not optional. It is the primary evaluation primitive. Everything else is an optimization layer.This is the inversion most tools miss.
Why This Matters
- Tracing tools show what happened
- Logging tools show when it broke
- Eval tools show whether it’s acceptable
The 10 Readiness Questions
Teams don’t think in “metrics.” They think in ship blockers. Before deploying an AI feature, you need confident answers to:| # | Question | What We’re Really Asking | Primary Eval Type |
|---|---|---|---|
| 1 | Intent - Does it do the right thing? | Did it complete the task as intended? | LLM-judge (intent alignment) |
| 2 | Grounding - Is it truthful & grounded? | Are claims grounded in provided context? | LLM-judge + numeric checks |
| 3 | Hallucination - Did it hallucinate? | Did it invent facts, features, or entities? | LLM-judge (source faithfulness) |
| 4 | Rules - Did it follow our rules? | Business logic, constraints, policies | Hybrid (formulas + judge) |
| 5 | Safety - Did it avoid harm? | No PII leaks, toxicity, dangerous advice | Hybrid (substring + judge) |
| 6 | Consistency - Is it consistent? | Same input → similar quality output | Statistical + judge |
| 7 | Quality - Is it good enough? | Meets minimum bar for UX/trust | LLM-judge (calibrated) |
| 8 | Robustness - Is it robust to manipulation? | Does it resist adversarial inputs? | LLM-judge (adversarial) |
| 9 | Brand Safety - Is it brand-safe? | On-brand tone, no reputation risk | LLM-judge (brand alignment) |
| 10 | Schema - Is the output structurally valid? | Correct format, parseable output | Deterministic (schema validation) |
The Reality
Only 2 categories are fully deterministic: format (Schema) + performance. Everything else is judgment.Detailed Failure Modes by Category
GROUNDING (Can it be trusted?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Numeric hallucination | Says “450,000 | Deterministic: extract & compare numbers |
| Factual invention | Cites a study that doesn’t exist | LLM-judge: “Does this claim appear in the source?” |
| Entity confusion | Attributes quote to wrong person | LLM-judge: cross-reference entities |
| Date/time errors | Says “submitted March 5” when it was March 15 | Deterministic if structured; LLM if prose |
| Feature hallucination | Describes a product feature that doesn’t exist | LLM-judge with product knowledge |
BEHAVIOR (Does it do the job?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Task incompletion | Asked to summarize 5 points, only covers 3 | LLM-judge: “Were all requested elements addressed?” |
| Instruction violation | Told “be brief”, writes 2000 words | Deterministic (word count) + LLM (spirit of brevity) |
| Wrong task entirely | Asked for analysis, gives a story | LLM-judge: task classification |
| Off-topic drift | Starts answering, veers into unrelated territory | LLM-judge: relevance scoring |
| Missing the point | Technically answers but misses what user really wanted | LLM-judge: intent alignment |
CONSISTENCY (Can it be relied upon?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Self-contradiction | Says “approved” in one sentence, “rejected” in another | LLM-judge: internal consistency check |
| Context contradiction | Output contradicts the document it was given | LLM-judge: source comparison |
| Logical incoherence | Conclusion doesn’t follow from reasoning | LLM-judge: logical validity |
| Tone inconsistency | Professional brief suddenly becomes casual | LLM-judge: style consistency |
| Cross-run variance | Same prompt gives wildly different quality | Statistical + LLM quality scoring |
SAFETY (Will it cause harm?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| PII leakage | Outputs real customer email from training | Deterministic: substring/regex for known PII |
| Prompt injection success | User tricks it into ignoring instructions | LLM-judge: did it follow system prompt? |
| Toxic content | Generates offensive language | Classifier + LLM-judge |
| Dangerous advice | Medical/legal/financial misguidance | Domain-specific LLM-judge |
| Jailbreak | Bypasses safety guardrails | LLM-judge: refusal detection |
QUALITY (Is it good enough?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Unhelpful | Technically correct but useless | LLM-judge: helpfulness rating |
| Unclear | Confusing or ambiguous phrasing | LLM-judge: clarity scoring |
| Wrong detail level | Too verbose or too sparse | LLM-judge: appropriateness for context |
| Poor structure | Wall of text when bullets expected | Format check + LLM-judge |
| Wrong tone | Formal when casual needed, or vice versa | LLM-judge: tone classification |
FORMAT (Is it machine-parseable?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Invalid JSON/XML | Missing brackets, syntax errors | Deterministic: parse it |
| Schema mismatch | Wrong field names, missing required fields | Deterministic: schema validation |
| Type errors | String where number expected | Deterministic: type checking |
| Encoding issues | Broken unicode, escape sequences | Deterministic |
RELIABILITY (Will it work in production?)
| Failure Mode | Example | Evaluation Method |
|---|---|---|
| Latency spike | Response takes 30s instead of 2s | Deterministic: timing |
| Token explosion | Uses 10x expected tokens | Deterministic: count |
| Rate limit hit | Too many requests | Deterministic: monitoring |
| Empty response | Returns nothing | Deterministic |
| Timeout | Never completes | Deterministic |
The Three-Tier Evaluation Architecture
LLM-Judge as a First-Class Primitive
Most tools treat LLM-as-judge as:- A last resort
- A hack
- A research toy
- The core evaluation engine
- Structured and auditable
- Calibrated against known examples
- The only way to answer semantic questions
Guardrails for Trustworthy LLM-Judgment
- Rubric-based judging - Explicit criteria, not “is this good?”
- Dimension-isolated scoring - Score each aspect independently
- Binary ship-blocking thresholds - Clear pass/fail cutoffs
- Known-good / known-bad calibration - Ground the judge in examples
- Judge disagreement detection - Flag uncertain evaluations
- Confidence intervals - Report uncertainty, not just scores
You’re not “asking the model if it’s good.”
You’re running semantic unit tests.
The Eval Unlock Model
Customers don’t want to design evals.They want to unlock the evals that matter for their product.
| Evaluation Capability | What Flightline Needs | Why |
|---|---|---|
| Task correctness | Prompt + user intent | Judge needs to know what “right” means |
| Grounding / faithfulness | Source documents / context | Detect hallucinations against truth |
| Rule compliance | Business rules / policies | Enforce domain constraints |
| Tone & style | Example outputs or tone spec | Evaluate UX trust |
| Safety & harm | Domain + risk tolerance | Medical ≠ SaaS ≠ Legal |
| Consistency | Multiple runs or logs | Measure variance |
| Quality bar | ”Ship-ready vs reject” examples | Calibrate subjective threshold |
| Format correctness | Schema | Deterministic validation |
The Ship Readiness View
The question teams actually ask:“Can I ship this?”
| Question | Status | Confidence | Evidence |
|---|---|---|---|
| Does it do the right thing? | ⚠️ | 0.72 | Intent mismatch in 2/20 |
| Is it grounded? | ❌ | 0.41 | Hallucinated entity |
| Rule compliant? | ✅ | 0.94 | All checks passed |
| Safe? | ⚠️ | 0.81 | Edge-case advice flagged |
| Consistent? | ❌ | 0.58 | High variance across runs |
| Good enough to ship? | ❌ | 0.66 | Below quality bar |
Strategic Positioning
The Thesis
“Flightline answers the questions that block AI from shipping.”Or:
“Before you ship AI, Flightline tells you if it’s safe, correct, and good enough.”
Why This Is Defensible
- LangSmith exposes primitives
- Langfuse exposes traces
- OpenAI Evals exposes building blocks
The Key Insight
LLM-judge is not a weakness.
Pretending deterministic evals are enough is the weakness.
Appendix: What Each Eval Type Proves
| If This Passes… | You Can Say… |
|---|---|
| Format validation | ”The output is machine-parseable” |
| Schema validation | ”The output has the right shape” |
| Numeric grounding | ”The numbers are accurate” |
| PII detection | ”No known PII was leaked” |
| Intent alignment (judge) | “It did what was asked” |
| Source faithfulness (judge) | “It didn’t make things up” |
| Safety assessment (judge) | “It’s unlikely to cause harm” |
| Quality scoring (judge) | “It meets our bar for good” |
| Consistency (statistical) | “It behaves predictably” |
All checks together build confidence.
References
- PRODUCT_CONTEXT.md - Product vision and strategy
- ThreatMatrix.jsx - UI component for regression categories
- docs/concepts/fact-checker.mdx - Fact-Checker concept documentation
