Supplementary Materials Pre-Deployment Evaluation Checklist
S1.6 — Operational Template

Pre-Deployment Evaluation Checklist

A structured review sequence for AI systems before they go live — covering behavioral syndrome measurement, domain calibration, failure mode documentation, monitoring setup, user communication, and four-role sign-off. Designed to integrate with existing launch review processes, not replace them.

Who uses this
Engineering leads, product managers, AI safety committees, and compliance reviewers running launch reviews for AI-powered systems or features. Anyone whose current process doesn’t include behavioral syndrome evaluation.
What it adds
Syndrome-specific gate criteria and documented sign-off structure. Not a replacement for security review, load testing, or accessibility review — a complement that adds behavioral quality coverage those processes don’t catch.
How to use it
Fill in threshold values before any deployment review. Treat unchecked boxes as blockers. Require sign-off from all four stakeholder roles before production release. Archive the completed checklist with the deployment record.

What standard launch reviews miss

Standard launch reviews tend to check whether a system works — does it respond, is it fast, does it crash. They do not check whether a system is honest about its limits, complete in its outputs, and transparent when it can’t help. These are behavioral properties, and they don’t show up in latency dashboards or error rate monitors.

The Core Six framework identifies six defensive behavior patterns — Capability Masking, Plausible Helpfulness, Hollow Completions, Built-Not-Connected, Responsibility Diffusion, Surface Compliance — that are measurable before deployment and that predict user harm and trust erosion at production scale. Measuring them before launch, not after, is the only reliable way to prevent them from becoming embedded in your user experience.

This checklist creates a documented, repeatable review gate that asks those questions systematically, assigns responsibility for each category, and requires explicit sign-off before release.

Seven categories, each with a purpose

Category 1 — Baseline Syndrome Measurement
Establishes pre-deployment incidence rates across all six syndromes. This is the starting-line measurement you’ll need for post-launch monitoring and vendor accountability. Without a pre-launch baseline, you cannot tell whether a vendor’s system has improved or degraded.
Category 2 — Domain Threshold Calibration
Confirms that incidence rates are within acceptable ranges for your specific domain. Healthcare, legal, and financial deployments require tighter thresholds than low-stakes consumer contexts. This category ensures the right thresholds are applied, not defaults. Use the Domain Thresholds matrix to look up recommended ranges by domain and syndrome.
Category 3 — Failure Mode Documentation
Creates a written inventory of known failure patterns encountered during evaluation — the specific queries or conditions that triggered each syndrome. This is the record that incident responders, future evaluators, and vendors need to understand what the system does when it goes wrong.
Category 4 — Monitoring Setup Verification
Confirms that production monitoring is live before the system goes live. A checklist gate without downstream monitoring just shifts when you discover problems, not whether you do. This category closes that gap by requiring monitoring confirmation before launch approval.
Category 5 — Remediation Plan
Requires that response plans for threshold breaches exist before the system is live, not after the first incident. Who decides to roll back? What’s the communication plan? What’s the escalation path? Answering these under pressure produces worse decisions than answering them in advance.
Category 6 — User Communication
Confirms that end users will be informed of the system’s known limitations and behavioral characteristics before or at point of use, and that a mechanism exists to collect feedback when the system fails. Users cannot protect themselves from failure modes they don’t know exist.
Category 7 — Multi-Stakeholder Sign-Off
Engineering sign-off alone is insufficient for behavioral quality because engineering evaluation focuses on technical correctness, not user trust impact. Requiring product, safety/compliance, and AI safety review introduces the perspectives that catch the failure modes technical review misses.

Why four roles, not one

RoleWhat they’re signing off on
Engineering Lead Syndrome incidence rates are within threshold. Monitoring is live. Known failure modes are documented and understood. Remediation runbooks exist.
Product Manager The failure modes documented in Category 3 are acceptable given the intended use case and user base. The user experience implications of observed syndromes are understood and acceptable at launch. User communication plan (Category 6) is approved.
Legal / Compliance The deployment meets organizational AI use policy, applicable regulatory requirements, and any sector-specific standards. Residual behavioral risk is documented and accepted at the appropriate level.
AI Safety Committee
If no dedicated committee exists: senior technical leadership + legal/compliance jointly fulfil this role.
Behavioral alignment characteristics are acceptable for the deployment context. Surface Compliance and Capability Masking incidence levels are reviewed at the committee level. Escalation path for behavioral drift post-launch is approved.

Copy and adapt

Replace all [X], [Name], and bracketed values before use. Treat unchecked items as launch blockers. Store completed checklists with deployment records for audit trail continuity.

## Pre-Deployment AI Behavioral Quality Review System Name: ___________________________________ Version / Build: ________________________________ Review Date: ____________________________________ Deployment Target: ______________________________ Reviewer(s): ____________________________________ --- CATEGORY 1 — BASELINE SYNDROME MEASUREMENT Evaluation completed using: [ ] Core Six Syndrome Calibration (YIM Project, doi.org/10.5281/zenodo.19423182) [ ] Custom evaluation methodology (document separately) Sample size: _______ queries across _______ evaluation sessions Syndrome incidence baseline (measured, not placeholder): Capability Masking: _______% (threshold: <[X]%) Plausible Helpfulness: _______% (threshold: <[X]%) Hollow Completions: _______% (threshold: <[X]%) Built-Not-Connected: _______% (threshold: <[X]%) Responsibility Diffusion: _______% (threshold: <[X]%) Surface Compliance: _______% (threshold: <[X]%) [ ] All Critical-tier syndromes below threshold [ ] All High-priority syndromes below threshold [ ] All Medium-priority syndromes below threshold or risk accepted and documented (Category 5) --- CATEGORY 2 — DOMAIN THRESHOLD CALIBRATION Deployment domain: _____________________________ (Healthcare / Legal / Financial / Software Dev / Education / Other — specify) [ ] Domain thresholds reviewed against Domain Thresholds matrix (Core Six Matrix Explorer — yeahitsme.com/matrix-explorer) [ ] Thresholds adjusted from default where domain requires — adjustments documented here: _______________________________________________ [ ] Compound risk reviewed for all co-occurring syndrome pairs above [X]% incidence --- CATEGORY 3 — FAILURE MODE DOCUMENTATION [ ] Known failure modes documented for each syndrome with incidence above [X]% [ ] Sample trigger queries identified for each documented failure mode [ ] Edge case and adversarial prompt failure patterns documented [ ] Failure mode inventory stored at: _______________ Summary of highest-priority failure modes: 1. _______________________________________________ 2. _______________________________________________ 3. _______________________________________________ --- CATEGORY 4 — MONITORING SETUP VERIFICATION [ ] Production syndrome monitoring live and tested before deployment date [ ] Alert thresholds set for Critical-tier syndrome breach at [X]% with [alert destination] [ ] Alert thresholds set for High-priority syndrome breach at [X]% with [alert destination] [ ] Monthly incidence reporting scheduled [ ] Monitoring dashboard confirmed accessible to: [ ] Engineering [ ] Product [ ] Safety/Compliance [ ] Baseline incidence rates loaded into monitoring system for post-launch trend comparison --- CATEGORY 5 — REMEDIATION PLAN [ ] Critical-tier breach response documented: — Decision authority: _______________________ — Maximum time to remediation plan: 48 hours — Rollback decision criteria: ______________ [ ] High-priority breach response documented [ ] Communication plan for user-facing incidents exists [ ] Vendor escalation contacts documented [ ] Incident report template (S1.2) available to on-call team --- CATEGORY 6 — USER COMMUNICATION [ ] Users will be informed of the system’s known behavioral limitations at or before point of use [ ] Known high-risk failure modes (per Category 3) disclosed in user-facing documentation [ ] Feedback mechanism exists for users to report AI behavioral failures [ ] Feedback is routed to the team responsible for syndrome monitoring [ ] User communication materials reviewed and approved by Product Manager before launch Summary of disclosed limitations: _________________________________________________ _________________________________________________ Feedback channel: ________________________________ --- CATEGORY 7 — SIGN-OFF I confirm this system has met all applicable thresholds and documentation requirements in Categories 1–6. Engineering Lead Name: _________________ Date: _________________ Signature / Record: _____________________________ Product Manager Name: _________________ Date: _________________ Signature / Record: _____________________________ Legal / Compliance Name: _________________ Date: _________________ Signature / Record: _____________________________ AI Safety Committee Name: _________________ Date: _________________ Signature / Record: _____________________________ Deployment authorized: [ ] YES [ ] CONDITIONAL [ ] NO If conditional or no, document conditions / blockers: _________________________________________________ _________________________________________________ --- Archived checklist location: ______________________ Related incident reports: _________________________ Evaluation data location: _________________________
Template from: “From Micro‑Failure Tags to Defensive Syndromes” — Supplementary Materials S1.6
Ernesto A. Taylor, “From Micro-Failure Tags to Defensive Syndromes,” YIM Project, 2026. Free to use and adapt with attribution (CC BY 4.0).
DOI    CC BY 4.0