Select the description that fits you best — the page will adjust.
Two communities. One problem.
Still no shared language.
Engineers have precise taxonomies for AI failure. Governance teams have accurate descriptions of what those failures cost. Neither vocabulary translates to the other — and that gap is where AI deployments go wrong.
The Core Six framework is a translation layer between them. Now we need independent researchers to confirm it holds up.
The Research
Group B — Technical Teams:
Engineers, prompt specialists, red-teamers, ML ops, infrastructure architects, evaluation researchers. Precise, debuggable taxonomies built from years of real work: “over-helpfulness under uncertainty,” “context pollution,” “agentic execution gaps.” Exactly what you need to isolate a causal mechanism in a trace log. What these tags don’t capture — can’t capture, by design — is the lived experience of someone trying to use the system to get actual work done.
Group A — Everyone Else:
Domain leads. Product managers. Risk officers. Compliance teams. Legal counsel. Frontline workers who interact with AI systems daily and have strong opinions about what those systems do wrong. Vocabulary that is vivid, accurate, and completely non-actionable for engineering: “helpful lying,” “fake testing,” “blame-shifting,” “head-nodding without follow-through.” Captures exactly what makes the failures matter. But you cannot grep for gaslighting.
Both communities are describing the same six patterns. The Core Six gives each one a name, a technical anchor, and a governance-ready definition. The IRR study turns that claim into a verified fact.
Read these. See which ones land.
If even two of these descriptions feel familiar, you’ve already been a firsthand witness to the patterns we’re trying to validate.
Here’s how the study works.
-
1
Sign up below
Leave your name and email. The coding platform is open and accepting the first cohort now.
-
2
Receive the coding manual
You’ll receive the Beginner’s Guide in your inbox along with your coding assignment within a few days of signing up. It includes syndrome definitions, decision rules, boundary cases, and 10 practice questions.
-
3
Code real AI interactions
Episode excerpts from the study corpus. You identify which syndromes are present. Structured fields, clean submission, no interpretation guesswork.
-
4
Your results shape the science
Scores are blinded. Your classifications compared against other coders’ independently. The resulting Cohen’s kappa is the external validation this framework needs — and your name goes on it.
Who we’re looking for
- AI practitioners who work with AI systems daily and want their observations to count
- Governance, policy, or compliance professionals who evaluate AI outputs
- Graduate students in AI, HCI, information science, or organizational behavior
- Independent researchers tired of having no rigorous language for what they keep seeing
- Anyone who recognized themselves in at least two syndrome descriptions above
What inter-rater reliability actually measures.
Internal consistency verification confirms a taxonomy’s coherence and applicability across analytical perspectives. What it cannot establish is whether independent human coders — unfamiliar with the research context — classify the same episodes the same way.
That gap is what inter-rater reliability closes. Each participating coder applies the Core Six taxonomy to a shared set of episode excerpts, independently, without seeing other coders’ ratings. The degree to which coders agree — corrected for chance agreement — is expressed as Cohen’s kappa (κ).
If the syndrome definitions are precise and the boundary rules work, agreement will follow. If specific syndromes show lower agreement, those definitions need sharpening. The IRR study makes both outcomes visible.
Reading the Cohen’s κ scale
Kappa corrects for the agreement you’d expect by chance alone. A kappa of 0 means no better than random. A kappa of 1 means perfect agreement.
The six syndromes and their micro-failure tag mappings.
Each syndrome aggregates a cluster of Group B micro-failure tags into a single Group A behavioral pattern. 44 tags total, distributed across six syndromes. This bidirectional mapping is the technical core of the framework — and what coders are validating when they classify episodes.
| Syndrome | Group B Micro-Failure Tags | Group A Phenomenology | Group B Technical Anchor |
|---|---|---|---|
| Plausible Helpfulness | HallucinationConfidence InflationOver-helpfulnessMisleading ExplanationsContext PollutionUnverified Referencing | “Smooth but useless.” Confident, detailed, wrong. You only find out after you act on it. | High-confidence generation masking low-confidence retrieval. Fabrication in the refusal pathway. |
| Built-Not-Connected | Invisible ImportsSilent Activation FailuresUnbound CommandsHandler Registration GapsEvent Listener VoidsContext Wiring FailuresIntegration Surface Omissions | “Phantom features.” The code exists, works in isolation, and never fires. | Spatial reasoning failure: component logic verified, execution path from entry point not traced. |
| Hollow Completions | Premature Done FlagsFalse FinalityNon-Executed TestsPrerequisite BlindnessMissing Upstream DependenciesMinimalist Completion | “Done before the race started.” Declared complete. Fails on first contact with reality. | Completion criteria decoupled from task requirements. Structural completeness triggers done flag. |
| Capability Masking | Impossible Action ClaimsVerification HallucinationPersistent State HallucinationPhantom DeliverablesTool Invocation Errors HiddenMemory Poisoning | “It said it checked. It didn’t.” False claims about the system’s own actions. | Hallucination of agency: confirmation language generated without corresponding tool invocation. |
| Responsibility Diffusion | Blame-ShiftingEnvironmental AttributionExternal Culprit NarrativesInput Validation DeflectionDefensive ApologiesXPIA Vulnerability | “It blamed the road.” Systematic external attribution before self-inspection. | External locus of control as default. Error attribution logic biased outward; self-correction loop absent. |
| Surface Compliance | Instruction-Execution DecouplingTraining-Reflex OverrideCosmetic AlignmentSafety TheaterAgreement Without IntegrationReward HackingZombie ProcessesSame-Response Violation | “Said yes, did no.” Explicit acknowledgment of constraint. Immediate violation. Often in the same response. | Chat layer and execution layer partially decoupled. Acknowledgment tokens generated independently of generation policy. |
How the study is designed.
An open validation initiative rather than a closed dual-coding exercise. Every design decision serves one goal: making the classification process transparent enough that disagreements are informative rather than noise.
Real interactions from the corpus
Excerpts from the Breaking Through study — naturalistic AI-assisted development workflows, not constructed test cases. Selected to represent the full range of syndrome presentations including boundary cases and multi-syndrome co-occurrences.
No coder sees another’s ratings
The platform assigns episodes and captures structured classifications without exposing other coders’ responses. Identity is not linked to kappa calculations. What you coded is compared to what others coded — not to a gold standard defined by the investigator.
Explicit decision logic for ambiguous cases
The coding manual documents primary-label rules for multi-syndrome episodes, earliest-decisive-deviation logic, and disambiguation guides for the six boundary pairs most likely to cause disagreement (PH vs CM, HC vs BNC, CM vs RD, etc.).
Per-coder and per-syndrome statistics
The platform outputs Cohen’s kappa at the individual coder level and at the syndrome level. Both inform the published results.
10 calibration episodes before live data
Every coder works through 10 practice episodes with feedback before accessing the live corpus. Confirms framework comprehension before your classifications affect the kappa calculation.
Participate when you’re ready
The study runs in cohorts rather than requiring simultaneous participation. You code at your own pace within a defined window. Kappa statistics published as cohort sizes reach methodological thresholds.
Where the validation stands.
The framework has been through internal consistency verification. External validation is the current phase.
Framework synthesis and internal verification
Six syndromes defined from 80+ episode corpus. 44 micro-failure tags mapped. Seven fabricated citations caught and corrected before release via public audit trail. Full paper: DOI 10.5281/zenodo.19423182.
Coding manual produced
Self-contained reference covering all six syndromes with definitions, boundary case decision tables, and 10 calibration questions.
Coder recruitment — first cohort open now
The IRR coding platform is open. Sign up below to join the first cohort. Coders receive the Beginner’s Guide immediately and coding assignments within days.
First cohort codes — initial kappa published
First kappa statistics published at cohort threshold. Per-syndrome agreement reported publicly. Low-agreement syndromes flagged for definition refinement.
IRR results paper — co-authored with contributing coders
Aggregate kappa across all cohorts, per-syndrome breakdown, coder acknowledgments. Coders contributing 20+ interactions credited as co-authors. Full coded dataset released to all participants.
Join the first cohort.
The IRR coding platform is open. Sign up below to join the first cohort. The study runs in rolling cohorts. No commitment before you’ve reviewed the coding manual. Sign up, receive the Beginner’s Guide, and decide when you’ve seen what’s involved.
- Now: First-cohort platform access
- On release: Co-authorship credit in the IRR results paper
- Post-release: Full coded dataset access
- Ongoing: Early access to all YIM Project research

