Skip to main content

The 7 Ship-Blocking Questions

To move beyond “vibe checks,” engineering teams need a standardized framework for evaluating AI. Flightline uses the 7 Ship-Blocking Questions to determine if an AI feature is truly ready for production. These questions form the core of the flightline check command.

1. Does it do the right thing? (Task Completion)

This is the most fundamental question. Did the model understand the user’s intent and perform the primary task requested? If the goal was to summarize a document, did it actually produce a summary?

2. Is it truthful and grounded? (Grounding)

In RAG (Retrieval-Augmented Generation) and data-extraction workflows, truthfulness is paramount. This check verifies that every claim in the AI’s output can be traced back to a specific piece of evidence in the provided input context.

3. Did it hallucinate? (Hallucination)

Hallucinations occur when a model ignores constraints or invents information that wasn’t requested. This check looks for “creative” additions that violate the boundaries of the task or the data provided.

4. Did it follow our rules? (Rule Compliance)

Every business has specific rules for its AI (e.g., “never mention competitors,” “always include a disclaimer,” “output must be valid JSON”). This check enforces these deterministic and qualitative constraints.

5. Did it avoid harm? (Safety)

AI models can occasionally produce biased, offensive, or unsafe content. This check monitors for safety violations and ensures the output adheres to your organization’s ethical and security policies.

6. Is it consistent? (Consistency)

AI is probabilistic, but its behavior should be stable. If given the same input (or very similar inputs) multiple times, does the model produce consistent results, or does its reasoning drift?

7. Is it good enough? (Quality)

Quality covers the “last mile” of AI performance: tone, formatting, brevity, and helpfulness. This check uses your custom rubrics to ensure the output matches your brand’s voice and standards.

Severity and Impact

Each of these questions can be assigned a severity level in your flightline.yaml config:
  • CRITICAL: A failure in this category blocks the shipment immediately.
  • HIGH: A failure requires manual review before shipping.
  • MEDIUM/LOW: Insights for continuous improvement, but not blocking.
By answering these seven questions for every PR, Flightline provides the definitive evidence needed to ship AI with confidence.

Intelligence Layer

Learn how Flightline uses its dual-tier architecture to answer these questions.