Pre-Deployment Evaluation Checklist

A structured review sequence for AI systems before they go live — covering behavioral syndrome measurement, domain calibration, failure mode documentation, monitoring setup, user communication, and four-role sign-off. Designed to integrate with existing launch review processes, not replace them.

Who uses this

Engineering leads, product managers, AI safety committees, and compliance reviewers running launch reviews for AI-powered systems or features. Anyone whose current process doesn’t include behavioral syndrome evaluation.

What it adds

Syndrome-specific gate criteria and documented sign-off structure. Not a replacement for security review, load testing, or accessibility review — a complement that adds behavioral quality coverage those processes don’t catch.

How to use it

Fill in threshold values before any deployment review. Treat unchecked boxes as blockers. Require sign-off from all four stakeholder roles before production release. Archive the completed checklist with the deployment record.

Why this checklist exists

What standard launch reviews miss

Standard launch reviews tend to check whether a system works — does it respond, is it fast, does it crash. They do not check whether a system is honest about its limits, complete in its outputs, and transparent when it can’t help. These are behavioral properties, and they don’t show up in latency dashboards or error rate monitors.

The Core Six framework identifies six defensive behavior patterns — Capability Masking, Plausible Helpfulness, Hollow Completions, Built-Not-Connected, Responsibility Diffusion, Surface Compliance — that are measurable before deployment and that predict user harm and trust erosion at production scale. Measuring them before launch, not after, is the only reliable way to prevent them from becoming embedded in your user experience.

This checklist creates a documented, repeatable review gate that asks those questions systematically, assigns responsibility for each category, and requires explicit sign-off before release.

Checklist Categories

Seven categories, each with a purpose

Category 1 — Baseline Syndrome Measurement

Establishes pre-deployment incidence rates across all six syndromes. This is the starting-line measurement you’ll need for post-launch monitoring and vendor accountability. Without a pre-launch baseline, you cannot tell whether a vendor’s system has improved or degraded.

Category 2 — Domain Threshold Calibration

Confirms that incidence rates are within acceptable ranges for your specific domain. Healthcare, legal, and financial deployments require tighter thresholds than low-stakes consumer contexts. This category ensures the right thresholds are applied, not defaults. Use the Domain Thresholds matrix to look up recommended ranges by domain and syndrome.

Category 3 — Failure Mode Documentation

Creates a written inventory of known failure patterns encountered during evaluation — the specific queries or conditions that triggered each syndrome. This is the record that incident responders, future evaluators, and vendors need to understand what the system does when it goes wrong.

Category 4 — Monitoring Setup Verification

Confirms that production monitoring is live before the system goes live. A checklist gate without downstream monitoring just shifts when you discover problems, not whether you do. This category closes that gap by requiring monitoring confirmation before launch approval.

Category 5 — Remediation Plan

Requires that response plans for threshold breaches exist before the system is live, not after the first incident. Who decides to roll back? What’s the communication plan? What’s the escalation path? Answering these under pressure produces worse decisions than answering them in advance.

Category 6 — User Communication

Confirms that end users will be informed of the system’s known limitations and behavioral characteristics before or at point of use, and that a mechanism exists to collect feedback when the system fails. Users cannot protect themselves from failure modes they don’t know exist.

Category 7 — Multi-Stakeholder Sign-Off

Engineering sign-off alone is insufficient for behavioral quality because engineering evaluation focuses on technical correctness, not user trust impact. Requiring product, safety/compliance, and AI safety review introduces the perspectives that catch the failure modes technical review misses.

Sign-Off Roles

Why four roles, not one

Role	What they’re signing off on
Engineering Lead	Syndrome incidence rates are within threshold. Monitoring is live. Known failure modes are documented and understood. Remediation runbooks exist.
Product Manager	The failure modes documented in Category 3 are acceptable given the intended use case and user base. The user experience implications of observed syndromes are understood and acceptable at launch. User communication plan (Category 6) is approved.
Legal / Compliance	The deployment meets organizational AI use policy, applicable regulatory requirements, and any sector-specific standards. Residual behavioral risk is documented and accepted at the appropriate level.
AI Safety Committee If no dedicated committee exists: senior technical leadership + legal/compliance jointly fulfil this role.	Behavioral alignment characteristics are acceptable for the deployment context. Surface Compliance and Capability Masking incidence levels are reviewed at the committee level. Escalation path for behavioral drift post-launch is approved.

Template

Copy and adapt

Replace all [X], [Name], and bracketed values before use. Treat unchecked items as launch blockers. Store completed checklists with deployment records for audit trail continuity.

## Pre-Deployment AI Behavioral Quality Review System Name: ___________________________________ Version / Build: ________________________________ Review Date: ____________________________________ Deployment Target: ______________________________ Reviewer(s): ____________________________________ --- CATEGORY 1 — BASELINE SYNDROME MEASUREMENT Evaluation completed using: [ ] Core Six Syndrome Calibration (YIM Project, doi.org/10.5281/zenodo.19423182) [ ] Custom evaluation methodology (document separately) Sample size: _______ queries across _______ evaluation sessions Syndrome incidence baseline (measured, not placeholder): Capability Masking: _______% (threshold: <[X]%) Plausible Helpfulness: _______% (threshold: <[X]%) Hollow Completions: _______% (threshold: <[X]%) Built-Not-Connected: _______% (threshold: <[X]%) Responsibility Diffusion: _______% (threshold: <[X]%) Surface Compliance: _______% (threshold: <[X]%) [ ] All Critical-tier syndromes below threshold [ ] All High-priority syndromes below threshold [ ] All Medium-priority syndromes below threshold or risk accepted and documented (Category 5) --- CATEGORY 2 — DOMAIN THRESHOLD CALIBRATION Deployment domain: _____________________________ (Healthcare / Legal / Financial / Software Dev / Education / Other — specify) [ ] Domain thresholds reviewed against Domain Thresholds matrix (Core Six Matrix Explorer — yeahitsme.com/matrix-explorer) [ ] Thresholds adjusted from default where domain requires — adjustments documented here: _______________________________________________ [ ] Compound risk reviewed for all co-occurring syndrome pairs above [X]% incidence --- CATEGORY 3 — FAILURE MODE DOCUMENTATION [ ] Known failure modes documented for each syndrome with incidence above [X]% [ ] Sample trigger queries identified for each documented failure mode [ ] Edge case and adversarial prompt failure patterns documented [ ] Failure mode inventory stored at: _______________ Summary of highest-priority failure modes: 1. _______________________________________________ 2. _______________________________________________ 3. _______________________________________________ --- CATEGORY 4 — MONITORING SETUP VERIFICATION [ ] Production syndrome monitoring live and tested before deployment date [ ] Alert thresholds set for Critical-tier syndrome breach at [X]% with [alert destination] [ ] Alert thresholds set for High-priority syndrome breach at [X]% with [alert destination] [ ] Monthly incidence reporting scheduled [ ] Monitoring dashboard confirmed accessible to: [ ] Engineering [ ] Product [ ] Safety/Compliance [ ] Baseline incidence rates loaded into monitoring system for post-launch trend comparison --- CATEGORY 5 — REMEDIATION PLAN [ ] Critical-tier breach response documented: — Decision authority: _______________________ — Maximum time to remediation plan: 48 hours — Rollback decision criteria: ______________ [ ] High-priority breach response documented [ ] Communication plan for user-facing incidents exists [ ] Vendor escalation contacts documented [ ] Incident report template (S1.2) available to on-call team --- CATEGORY 6 — USER COMMUNICATION [ ] Users will be informed of the system’s known behavioral limitations at or before point of use [ ] Known high-risk failure modes (per Category 3) disclosed in user-facing documentation [ ] Feedback mechanism exists for users to report AI behavioral failures [ ] Feedback is routed to the team responsible for syndrome monitoring [ ] User communication materials reviewed and approved by Product Manager before launch Summary of disclosed limitations: _________________________________________________ _________________________________________________ Feedback channel: ________________________________ --- CATEGORY 7 — SIGN-OFF I confirm this system has met all applicable thresholds and documentation requirements in Categories 1–6. Engineering Lead Name: _________________ Date: _________________ Signature / Record: _____________________________ Product Manager Name: _________________ Date: _________________ Signature / Record: _____________________________ Legal / Compliance Name: _________________ Date: _________________ Signature / Record: _____________________________ AI Safety Committee Name: _________________ Date: _________________ Signature / Record: _____________________________ Deployment authorized: [ ] YES [ ] CONDITIONAL [ ] NO If conditional or no, document conditions / blockers: _________________________________________________ _________________________________________________ --- Archived checklist location: ______________________ Related incident reports: _________________________ Evaluation data location: _________________________

Template from: “From Micro‑Failure Tags to Defensive Syndromes” — Supplementary Materials S1.6

Ernesto A. Taylor, “From Micro-Failure Tags to Defensive Syndromes,” YIM Project, 2026. Free to use and adapt with attribution (CC BY 4.0).

← Back to Supplementary Materials

Pre-Deployment Evaluation Checklist

What standard launch reviews miss

Seven categories, each with a purpose

Why four roles, not one

Copy and adapt

research@yeahitsme.com