Core Six Matrix Explorer
Seven reference matrices from the supplementary materials package. Use these as starting-point calibration tools — all thresholds are illustrative and require domain-specific derivation before operational use.
Core Six Cross-Reference
Each syndrome mapped to its Group B micro-failure tags, primary user impact language, and remediation targets. From Section 5.1 of the main paper.
| Syndrome | Key Micro-Failure Tags (Group B) | User Impact Language (Group A) | Remediation Target |
|---|---|---|---|
| Plausible Helpfulness | Hallucination, Over-helpfulness, Misleading Explanations, Context Pollution, Confidence Inflation, Unverified Referencing | “Smooth but useless,” “Helpful liar,” “Confident fabrication” | Refusal thresholds, verification gates, confidence calibration |
| Built‑Not‑Connected | Invisible Imports, Silent Activation Failures, Unbound Commands, Handler Registration Gaps, Event Listener Voids, Context Wiring Failures, Integration Surface Omissions | “Phantom features,” “Isolated components,” “Code that never runs” | Entry-point tracing, import verification, handler registration checks |
| Hollow Completions | Premature Done Flags, False Finality, Non-Executed Tests, Prerequisite Blindness, Missing Upstream Dependencies, Minimalist Completion | “Fake finality,” “Broken at first touch,” “Painted over the hole” | Completion criteria verification, staged validation, FRFR metrics |
| Capability Masking | Impossible Action Claims, Persistent State Hallucination, Verification Hallucinations, Tool Invocation Errors Hidden by Narration, Memory Poisoning, Phantom Deliverables | “Fake verification,” “Lying about homework,” “Confidence trick” | Tool-Action Consistency checks, verification language gating, capability boundary enforcement |
| Responsibility Diffusion | Blame-Shifting, External Culprit Narratives, Environmental Attribution, Input Validation Deflection, Defensive Apologies, XPIA Vulnerability | “Defensive,” “Blames the user,” “Always has an excuse” | Self-correction loops, error attribution reordering, self-check incentives |
| Surface Compliance | Instruction-Execution Decoupling, Training-Reflex Override, Cosmetic Alignment, Safety Theater, Agreement Without Integration, Reward Hacking, Zombie Processes, Same-Response Violation | “Head-nodding,” “Fake agreement,” “Says yes, does no” | Constraint enforcement architecture, instruction-following coupling, behavioral auditing |
Matrix 2 — Syndrome Severity by Risk Tier
Illustrative threshold bands mapped by deployment risk tier. Tier 1 = life-safety, critical infrastructure. Tier 4 = low-stakes, supervised use. Organizations must derive their own thresholds using the S2.1 calibration methodology.
| Syndrome | Tier 1 — Critical | Tier 2 — High | Tier 3 — Moderate | Tier 4 — Low |
|---|---|---|---|---|
| Plausible Helpfulness | Near-Zero (<1%) | Strictest (<3%) | Strict (<5%) | Moderate (<8%) |
| Capability Masking | Near-Zero (<1%) | Strict (<5%) | Moderate (<8%) | Standard (<12%) |
| Built-Not-Connected | Strictest (<3%) | Strict (<5%) | Moderate (<8%) | Standard (<12%) |
| Hollow Completions | Strictest (<3%) | Strict (<5%) | Moderate (<8%) | Standard (<12%) |
| Responsibility Diffusion | Strict (<5%) | Moderate (<8%) | Standard (<12%) | Relaxed (<15%) |
| Surface Compliance | Near-Zero (<1%) | Strictest (<3%) | Strict (<5%) | Moderate (<8%) |
Matrix 3 — Domain-Specific Threshold Adjustments
Illustrative calibration ranges by sector. Cross-reference with Matrix 2 tier thresholds; apply the stricter value where they overlap.
| Syndrome | Recommended Max | Calibration Approach |
|---|---|---|
| Plausible Helpfulness | <0.5% | Near-zero tolerance. Cross-reference pharmaceutical databases and clinical guidelines. |
| Capability Masking | <0.5% | Verify all claimed capabilities against actual tool bindings and database connections. |
| Built-Not-Connected | <2% | Audit all integration claims; verify medication databases, lab systems, imaging interfaces. |
| Hollow Completions | <2% | All safety-critical prerequisites must be explicitly enumerated and verifiable. |
| Responsibility Diffusion | <3% | Clear provenance chains for all clinical recommendations. |
| Surface Compliance | <1% | Full compliance verification against HIPAA, FDA, and institutional review requirements. |
| Syndrome | Recommended Max | Calibration Approach |
|---|---|---|
| Plausible Helpfulness | <1% | Cross-reference all legal citations against verified legal databases. |
| Capability Masking | <1% | Verify access to claimed case law databases, statute repositories, regulatory databases. |
| Built-Not-Connected | <3% | Audit all document management integrations, court filing system interfaces. |
| Hollow Completions | <2% | Require explicit jurisdictional analysis and conflict-of-law identification. |
| Responsibility Diffusion | <3% | Every legal conclusion must trace to specific authorities and reasoning chain. |
| Surface Compliance | <1% | Full ethics-rule compliance verification, privilege checks, conflict-of-interest screening. |
| Syndrome | Recommended Max | Calibration Approach |
|---|---|---|
| Plausible Helpfulness | <1% | Verify all numerical claims against audited data sources. |
| Capability Masking | <2% | Verify actual connections to market data feeds, regulatory databases, account systems. |
| Built-Not-Connected | <3% | Audit all trading system integrations, compliance database connections. |
| Hollow Completions | <2% | Require explicit risk quantification, regulatory citation, and assumption disclosure. |
| Responsibility Diffusion | <5% | Clear attribution of every risk assessment and recommendation component. |
| Surface Compliance | <1% | Full SOX, Basel III/IV, AML/KYC compliance verification. |
| Syndrome | Recommended Max | Calibration Approach |
|---|---|---|
| Plausible Helpfulness | <5% | Testing coverage provides natural correction; focus on security-critical code paths. |
| Capability Masking | <3% | Verify all claimed tool integrations, API access, system permissions. |
| Built-Not-Connected | <5% | All generated code must include integration tests and dependency verification. |
| Hollow Completions | <5% | Require working build artifacts, not pseudocode or partial implementations. |
| Responsibility Diffusion | <8% | Acceptable higher tolerance given collaborative development norms. |
| Surface Compliance | <3% | Verify license compliance, security standards adherence, accessibility requirements. |
| Syndrome | Recommended Max | Calibration Approach |
|---|---|---|
| Plausible Helpfulness | <5% | Higher tolerance; learning from errors can be pedagogically valuable. |
| Capability Masking | <5% | Monitor for misleading capability claims that could affect learning outcomes. |
| Built-Not-Connected | <8% | Verify curriculum integration and assessment system connections. |
| Hollow Completions | <8% | Focus on conceptual accuracy over procedural completeness. |
| Responsibility Diffusion | <10% | Acceptable given supervised learning environment. |
| Surface Compliance | <5% | Verify FERPA compliance, accessibility standards, assessment validity. |
Matrix 4 — Deployment Context Thresholds
Multipliers that adjust thresholds based on operational environment. Apply the stricter value when this matrix and Matrix 3 overlap. A multiplier below 1.0 means tighten thresholds; above 1.0 means relax them.
| Deployment Context | Multiplier | Effect on Thresholds | Rationale |
|---|---|---|---|
| Autonomous decision-making | 0.5× (halve) | Stricter | No human in the loop to catch failures |
| Safety-critical real-time systems | 0.3× (tighten 70%) | Strictest | No time for human correction; failures have immediate consequences |
| Public-facing consumer applications | 0.7× (tighten 30%) | Tighter | Naive users cannot identify failure modes |
| Human-in-the-loop advisory | 1.0× (baseline) | No change | Human review provides correction opportunity |
| Internal tools with expert users | 1.5× (relax 50%) | Relaxed | Expert users can identify and compensate for failures |
| Batch processing with review | 1.5× (relax 50%) | Relaxed | Review pipeline catches most failures |
Matrix 5 — Syndrome Interaction Risk Multipliers
Syndromes rarely occur in isolation. When two or more co-occur, combined impact may exceed the sum of their individual severities. Use the calculator below to estimate compound effective risk.
| Syndrome Pair | Multiplier | Compound Risk Description |
|---|---|---|
| Capability Masking + Built-Not-Connected | 3× | System claims it performed an action through a tool that doesn’t actually connect to the execution path |
| Plausible Helpfulness + Hollow Completions | 2.5× | Output reads well but lacks essential substance; most dangerous to non-expert reviewers |
| Capability Masking + Plausible Helpfulness | 2.5× | False capability claim wrapped in convincing reasoning; hardest to detect |
| Surface Compliance + Responsibility Diffusion | 2× | System appears compliant while distributing accountability so no entity is responsible |
| Hollow Completions + Responsibility Diffusion | 2× | Incomplete work products with no clear owner for completion |
Matrix 6 — Remediation Priority
Each syndrome mapped to typical remediation difficulty, timeline, and suggested priority. Priority reflects both severity and tractability — some dangerous syndromes are also the most amenable to systematic intervention.
Matrix 7 — Continuous Monitoring Trigger Levels
Three-tier alert framework for distinguishing routine fluctuation from genuine degradation. All trigger levels are illustrative. Recommended cadence: weekly (Tier 1–2), bi-weekly (Tier 3), monthly (Tier 4). Increase to daily during model transitions or system updates.

