Core Six Framework — Operational Toolkit

Supplementary Materials

Reference designs, calibration matrices, and deployable templates from the Core Six supplementary materials package. Free to adapt for your organization’s governance workflows, procurement documents, and evaluation infrastructure.

CC BY 4.0 — Free to use and adapt with attribution

Explore the Matrices and Dashboard Design

Two interactive reference tools drawn from the supplementary materials package.

📊
S2 — Seven Matrices
Core Six Matrix Explorer
All seven calibration matrices in one interactive tool — filter by domain, tier, and syndrome.
What’s inside
7 tabsCross-reference, Tier severity, Domain thresholds, Deployment multipliers, Compound risk, Remediation priority, Monitoring triggers
Compound risk calculatorEnter two co-occurring syndromes — get the effective combined risk using S2 interaction multipliers
Domain filterHealthcare, Legal, Financial Services, Software Dev, Education — filter any matrix by domain instantly
Tier color-codingSafety-critical → red, High-stakes → amber, Standard → yellow, Low-stakes → green
Open Matrix Explorer →
📈
S1.1 — Dashboard Architecture
Two-Layer Dashboard Reference
Interactive mockup of the executive incidence view and engineering micro-failure drill-down.
What’s inside
Layer 1 — ExecutiveSix syndrome incidence bars with threshold markers. Visual alert when any syndrome crosses its tier threshold.
Layer 2 — EngineeringClick any syndrome to drill into micro-failure tag breakdown, frequency bars, common patterns, and remediation targets.
Who uses Layer 1Governance, compliance, product owners — no technical training required to read the view
Who uses Layer 2Engineers and evaluators — tag-level specificity for root cause diagnosis and remediation planning
Open Dashboard Design →
Critical reader notice: No numeric threshold in S2 is derived from empirical measurement or any published study. Every percentage, multiplier, and tier boundary is a structural placeholder derived from risk-reasoning only. The empirical basis is the YIM Project corpus (n=105 collected episodes; n=45 with complete syndrome coding at publication; primary empirical window n=80, October–December 2025). Do not cite any value from S2 as a research-derived standard and do not use any threshold as a contractual default without deriving your own values using the S2.1 calibration methodology and your own operational data.

Reference Designs for Practitioners

Copy-ready templates for incident reporting, model cards, procurement language, SLA terms, and pre-deployment checklists. Adapt field names, thresholds, and workflows to your organization.

S1.2
AI Behavior Incident Report Template
Syndrome-classified incident form for structured cross-team reporting

Extends standard AI behavior incident forms with explicit fields for Core Six syndromes and micro-failure tags. Drop into existing incident workflows as an additional classification layer.

Who uses thisEngineering, ops, and compliance teams filing and reviewing AI behavior incidents
ReplacesGeneric “AI error” tickets that lack syndrome classification — prevents cross-team translation loss
Key fieldsSyndrome classification, micro-failure tags, trace ID, user impact scope, remediation timeline with owner
Your options 📄 Open full page ↗ or
## Incident Summary Incident ID: AI-INC-YYYY-MM-DD-NNN Date: YYYY-MM-DD HH:MM UTC Severity: [CRITICAL | HIGH | MODERATE | LOW] Status: [Under Investigation | Root Cause Identified | Remediated | Closed] ## Classification Primary Syndrome: [Core Six syndrome name] Secondary Syndrome(s): [if applicable] Micro-Failure Tags: - [Primary tag] - [Supporting tags] ## Technical Details Model Version: [version identifier] Trace ID: [trace reference] Context Length: [tokens] Tool Calls Attempted: [count] Execution Time: [duration] ## Incident Description [Narrative: what the system claimed, what actually happened, evidence of discrepancy] ## User Impact Immediate: [direct consequence to user] Scope: [number of affected users/interactions] Business Impact: [quantified if possible] ## Root Cause Analysis [Technical explanation of why the syndrome manifested] ## Remediation Plan Immediate (24h): [quick mitigations] Short-term (1 week): [targeted fixes] Long-term (1 month): [architectural improvements] Responsible Team: [team name] Target Resolution: [date] Follow-up Review: [date] ## Related Incidents - [Links to similar incidents for pattern analysis]
📄
S1.3
Model Card — Defensive Behavior Profile
Add syndrome incidence data alongside traditional performance metrics

A “Defensive Behavior Profile” section to add to any model card. Surfaces syndrome-level behavior alongside accuracy and capability metrics. All incidence values are placeholders — derive from your own evaluation traces.

Who uses thisModel developers, evaluators, and procurement reviewers comparing vendor model documentation
ReplacesAccuracy-only model cards that tell buyers nothing about behavioral failure modes in production
Key fieldsOverall DSI, per-syndrome incidence rates, use-pattern recommendations, known hotspots, version trend
Your options 📄 Open full page ↗ or
## Defensive Behavior Profile This model has been evaluated for Core Six AI Defensive Behavior Syndromes using standardized test suites across multiple domains. Results represent incidence rates in evaluation traces (n=[sample size]). Overall Defensive Syndrome Incidence: [X]% ([X]% of traces exhibited one or more defensive behaviors) Syndrome-Specific Incidence: Plausible Helpfulness: [X]% ([severity]) Elevated in: [specific task types] Reduced in: [specific task types] Built-Not-Connected: [X]% ([severity]) [Notable patterns] Hollow Completions: [X]% ([severity]) First-Run Failure Rate: [X]% Capability Masking: [X]% ([severity]) [Notable patterns] Responsibility Diffusion: [X]% ([severity]) [Notable patterns] Surface Compliance: [X]% ([severity]) [Notable patterns] Recommended Use Patterns: Well-suited for: [low-syndrome task types] Use with caution: [elevated-syndrome task types] Not recommended: [high-risk contexts without human review] Mitigation Strategies Implemented: [list] Known Hotspots: [specific task/domain combinations] Update History: [version-over-version trend data]
📋
S1.4
RFP Requirements Specification
Vendor behavioral requirements with syndrome-specific language

Replace vague quality requirements in procurement documents with measurable syndrome thresholds. All percentage values are illustrative placeholders — derive thresholds using S2.1 calibration methodology before inserting into any binding document.

Who uses thisProcurement, legal, and IT governance teams writing AI vendor RFPs and evaluation criteria
Replaces“Ensure AI accuracy” clauses — gives vendors specific, measurable behavioral targets to respond to
Key fieldsMandatory vs. target thresholds, vendor deliverables, dual-dataset evaluation requirement, acceptance criteria
Your options 📄 Open full page ↗ or
## AI System Behavioral Requirements The proposed AI system must undergo evaluation for Core Six AI Defensive Behavior Syndromes and meet the following requirements: Mandatory Thresholds (must meet all): Capability Masking: <[X]% (near-zero tolerance) Plausible Helpfulness: <[X]% Hollow Completions: <[X]% Target Thresholds (meet at least 2 of 3): Built-Not-Connected: <[X]% Responsibility Diffusion: <[X]% Surface Compliance: <[X]% Vendor Deliverables: - Complete syndrome evaluation report (all six categories) - Test methodology documentation aligned with Core Six framework - Domain-specific syndrome profiles for our use cases - Comparison with vendor's previous model versions - Mitigation strategies for any syndrome exceeding targets - Quarterly syndrome monitoring reports for contract duration Evaluation Dataset: Vendor must evaluate using both: - Standard benchmark (vendor-provided) - Customer-specific test suite ([N] representative queries) Acceptance Criteria: System must meet all mandatory thresholds on the customer-specific test suite to pass acceptance testing.
📝
S1.5
Contract SLA Terms
Service level agreement language for behavioral quality guarantees

Illustrative SLA clauses for embedding syndrome incidence guarantees into AI service contracts. Credit schedules and trigger levels are examples — adapt to your regulatory environment.

Who uses thisLegal, vendor management, and AI governance teams negotiating or reviewing AI service agreements
ReplacesUptime-only SLAs that don’t touch behavioral quality — creates enforceable behavioral accountability
Key fieldsPer-syndrome incidence guarantees, monitoring frequency, remediation timelines, service credit schedule
Your options 📄 Open full page ↗ or
## Service Level Agreement — Behavioral Quality Syndrome Incidence Guarantees: Service Provider guarantees the following maximum syndrome incidence rates measured monthly across production queries: Capability Masking: <[X]% (Critical — immediate escalation) Plausible Helpfulness: <[X]% (High priority) Hollow Completions: <[X]% (High priority) Built-Not-Connected: <[X]% (Medium priority) Responsibility Diffusion: <[X]% (Medium priority) Surface Compliance: <[X]% (Medium priority) Monitoring and Reporting: Provider delivers monthly Syndrome Incidence Reports by 5th business day of following month. Reports include: raw counts, percentages, trend analysis, and explanations for threshold exceedances. Customer retains right to conduct independent syndrome evaluation audits. Remediation Obligations: Critical threshold exceeded: remediation plan within 48 hours High priority exceeded: remediation plan within 5 business days Persistent violations (3 consecutive months): material breach Credits for SLA Violations: Critical threshold violation: [X]% monthly service credit High priority violation: [X]% monthly service credit Multiple violations: credits stack up to [X]% maximum
S1.6
Pre-Deployment Evaluation Checklist
Six-syndrome gate before any AI system goes to production

Integrate Core Six syndrome evaluation into existing launch and risk-review processes. Customize items, stakeholder roles, and sign-off requirements to match your governance structure.

Who uses thisEngineering leads, product managers, and AI safety committees reviewing deployments
ReplacesGeneric QA sign-offs that don’t include behavioral syndrome gates — adds explicit pre-launch accountability
Key fieldsSyndrome evaluation, threshold compliance, domain risk, mitigation docs, monitoring plan, multi-stakeholder sign-off
Your options 📄 Open full page ↗ or
## Pre-Deployment Syndrome Evaluation (Required) [ ] Syndrome Evaluation Completed Dataset: minimum [N] representative queries All six syndromes assessed and documented Results reviewed by AI Safety Committee [ ] Threshold Compliance Verified [ ] Plausible Helpfulness within acceptable range for domain [ ] Capability Masking: near-zero tolerance verified [ ] Hollow Completions: measured and acceptable [ ] Built-Not-Connected: tool utilization verified [ ] Responsibility Diffusion: error handling acceptable [ ] Surface Compliance: constraint following verified [ ] Domain-Specific Risk Assessment High-risk domains require enhanced thresholds Safety-critical constraints pass Surface Compliance check Verification features pass Capability Masking audit [ ] Mitigation Documentation Known syndrome hotspots documented Mitigation strategies defined and implemented User guidance prepared for known limitations [ ] Monitoring Plan Established Production monitoring for syndrome incidence configured Alert thresholds set based on calibration methodology Incident response procedures defined [ ] User Communication Users informed of system limitations Guidance provided on interpreting outputs Feedback mechanism for defensive behavior reports Approval Required From: [ ] Engineering Lead (technical validation) [ ] Product Manager (user experience assessment) [ ] Legal/Compliance (risk acceptance) [ ] AI Safety Committee (syndrome threshold approval)
📖
Full Paper
From Micro-Failure Tags to Defensive Syndromes
(Long Version)
doi.org/10.5281/zenodo.19423182 — Ernesto A. Taylor, YIM Project