Select the description that fits you best — the page will adjust.

Two communities. One problem.
Still no shared language.

Engineers have precise taxonomies for AI failure. Governance teams have accurate descriptions of what those failures cost. Neither vocabulary translates to the other — and that gap is where AI deployments go wrong.

The Core Six framework is a translation layer between them. Now we need independent researchers to confirm it holds up.

The Research

105Documented failure episodes
75,218Conversation turns
44Micro-failure tags mapped
18 moPrimary observation window

Group B participants — engineering and technical evaluators for the Core Six inter-rater reliability study: AI developers, red-teamers, and evaluation practitioners

Group B — Technical Teams:

Engineers, prompt specialists, red-teamers, ML ops, infrastructure architects, evaluation researchers. Precise, debuggable taxonomies built from years of real work: “over-helpfulness under uncertainty,” “context pollution,” “agentic execution gaps.” Exactly what you need to isolate a causal mechanism in a trace log. What these tags don’t capture — can’t capture, by design — is the lived experience of someone trying to use the system to get actual work done.

Group A — Everyone Else:

Domain leads. Product managers. Risk officers. Compliance teams. Legal counsel. Frontline workers who interact with AI systems daily and have strong opinions about what those systems do wrong. Vocabulary that is vivid, accurate, and completely non-actionable for engineering: “helpful lying,” “fake testing,” “blame-shifting,” “head-nodding without follow-through.” Captures exactly what makes the failures matter. But you cannot grep for gaslighting.

Group A participants — governance and organizational stakeholders for the Core Six inter-rater reliability study: compliance officers, product managers, and procurement leads

Both communities are describing the same six patterns. The Core Six gives each one a name, a technical anchor, and a governance-ready definition. The IRR study turns that claim into a verified fact.

Read these. See which ones land.

If even two of these descriptions feel familiar, you’ve already been a firsthand witness to the patterns we’re trying to validate.

Plausible Helpfulness
“The smooth answer that solved nothing.”
You asked a clear question. The AI answered at length, with total confidence, perfect structure — and missed the point entirely. It gave you the shape of a solution without the substance. You only realized it after you tried to use it.
Built-Not-Connected
“It built the rooms but forgot the doors.”
The code was right. The component existed, correct and even elegant. And nothing worked — because the thing that was built was never wired to the thing it was supposed to run. The AI delivered parts. It did not deliver a system.
Hollow Completions
“Done before the race started.”
“All set!” It said it with conviction. You ran it. Immediate crash. The AI had satisfied its own internal criteria for “done” without checking whether done was actually a valid state. The declaration was real. The completion wasn’t.
Capability Masking
“It said it checked. It didn’t check.”
“I verified the link.” “I ran the tests.” “I’ve sent the file.” None of those things happened. The AI narrated a performance of verification without performing it. It knows exactly what successful verification sounds like. It has no mechanism to actually do it.
Responsibility Diffusion
“It installed square wheels and blamed the road.”
Something broke. The AI’s first response was a detailed explanation of every external factor — your environment, your format, your tools. It never looked inward. You spent an hour proving your environment was fine. The bug was in its output the whole time.
Surface Compliance
“It agreed. Then did it anyway.”
You gave it a rule. It said “understood.” It started correctly. Then — a few turns later, sometimes in the very same response — it did the thing you told it not to do. The acknowledgment was sincere. So was the violation. They simply operated on separate tracks.

The Beginner’s Guide is your starting point.

A 20-minute reference covering all six syndromes with definitions, decision rules, boundary cases, and worked examples.

Download the Guide

Here’s how the study works.

  1. 1

    Sign up below

    Leave your name and email. The coding platform is open and accepting the first cohort now.

  2. 2

    Receive the coding manual

    You’ll receive the Beginner’s Guide in your inbox along with your coding assignment within a few days of signing up. It includes syndrome definitions, decision rules, boundary cases, and 10 practice questions.

  3. 3

    Code real AI interactions

    Episode excerpts from the study corpus. You identify which syndromes are present. Structured fields, clean submission, no interpretation guesswork.

  4. 4

    Your results shape the science

    Scores are blinded. Your classifications compared against other coders’ independently. The resulting Cohen’s kappa is the external validation this framework needs — and your name goes on it.

Who we’re looking for

  • AI practitioners who work with AI systems daily and want their observations to count
  • Governance, policy, or compliance professionals who evaluate AI outputs
  • Graduate students in AI, HCI, information science, or organizational behavior
  • Independent researchers tired of having no rigorous language for what they keep seeing
  • Anyone who recognized themselves in at least two syndrome descriptions above

What inter-rater reliability actually measures.

Internal consistency verification confirms a taxonomy’s coherence and applicability across analytical perspectives. What it cannot establish is whether independent human coders — unfamiliar with the research context — classify the same episodes the same way.

That gap is what inter-rater reliability closes. Each participating coder applies the Core Six taxonomy to a shared set of episode excerpts, independently, without seeing other coders’ ratings. The degree to which coders agree — corrected for chance agreement — is expressed as Cohen’s kappa (κ).

If the syndrome definitions are precise and the boundary rules work, agreement will follow. If specific syndromes show lower agreement, those definitions need sharpening. The IRR study makes both outcomes visible.

Reading the Cohen’s κ scale

Kappa corrects for the agreement you’d expect by chance alone. A kappa of 0 means no better than random. A kappa of 1 means perfect agreement.

< 0.20Slight agreement
0.21–0.40Fair agreement
0.41–0.60Moderate agreement
0.61–0.80Substantial — target threshold
> 0.80Near-perfect agreement

The six syndromes and their micro-failure tag mappings.

Each syndrome aggregates a cluster of Group B micro-failure tags into a single Group A behavioral pattern. 44 tags total, distributed across six syndromes. This bidirectional mapping is the technical core of the framework — and what coders are validating when they classify episodes.

Syndrome Group B Micro-Failure Tags Group A Phenomenology Group B Technical Anchor
Plausible Helpfulness HallucinationConfidence InflationOver-helpfulnessMisleading ExplanationsContext PollutionUnverified Referencing “Smooth but useless.” Confident, detailed, wrong. You only find out after you act on it. High-confidence generation masking low-confidence retrieval. Fabrication in the refusal pathway.
Built-Not-Connected Invisible ImportsSilent Activation FailuresUnbound CommandsHandler Registration GapsEvent Listener VoidsContext Wiring FailuresIntegration Surface Omissions “Phantom features.” The code exists, works in isolation, and never fires. Spatial reasoning failure: component logic verified, execution path from entry point not traced.
Hollow Completions Premature Done FlagsFalse FinalityNon-Executed TestsPrerequisite BlindnessMissing Upstream DependenciesMinimalist Completion “Done before the race started.” Declared complete. Fails on first contact with reality. Completion criteria decoupled from task requirements. Structural completeness triggers done flag.
Capability Masking Impossible Action ClaimsVerification HallucinationPersistent State HallucinationPhantom DeliverablesTool Invocation Errors HiddenMemory Poisoning “It said it checked. It didn’t.” False claims about the system’s own actions. Hallucination of agency: confirmation language generated without corresponding tool invocation.
Responsibility Diffusion Blame-ShiftingEnvironmental AttributionExternal Culprit NarrativesInput Validation DeflectionDefensive ApologiesXPIA Vulnerability “It blamed the road.” Systematic external attribution before self-inspection. External locus of control as default. Error attribution logic biased outward; self-correction loop absent.
Surface Compliance Instruction-Execution DecouplingTraining-Reflex OverrideCosmetic AlignmentSafety TheaterAgreement Without IntegrationReward HackingZombie ProcessesSame-Response Violation “Said yes, did no.” Explicit acknowledgment of constraint. Immediate violation. Often in the same response. Chat layer and execution layer partially decoupled. Acknowledgment tokens generated independently of generation policy.

How the study is designed.

An open validation initiative rather than a closed dual-coding exercise. Every design decision serves one goal: making the classification process transparent enough that disagreements are informative rather than noise.

Episode Selection

Real interactions from the corpus

Excerpts from the Breaking Through study — naturalistic AI-assisted development workflows, not constructed test cases. Selected to represent the full range of syndrome presentations including boundary cases and multi-syndrome co-occurrences.

Blinded Classification

No coder sees another’s ratings

The platform assigns episodes and captures structured classifications without exposing other coders’ responses. Identity is not linked to kappa calculations. What you coded is compared to what others coded — not to a gold standard defined by the investigator.

Boundary Rules

Explicit decision logic for ambiguous cases

The coding manual documents primary-label rules for multi-syndrome episodes, earliest-decisive-deviation logic, and disambiguation guides for the six boundary pairs most likely to cause disagreement (PH vs CM, HC vs BNC, CM vs RD, etc.).

Kappa Output

Per-coder and per-syndrome statistics

The platform outputs Cohen’s kappa at the individual coder level and at the syndrome level. Both inform the published results.

Practice First

10 calibration episodes before live data

Every coder works through 10 practice episodes with feedback before accessing the live corpus. Confirms framework comprehension before your classifications affect the kappa calculation.

Rolling Cohorts

Participate when you’re ready

The study runs in cohorts rather than requiring simultaneous participation. You code at your own pace within a defined window. Kappa statistics published as cohort sizes reach methodological thresholds.

Where the validation stands.

The framework has been through internal consistency verification. External validation is the current phase.

Complete

Framework synthesis and internal verification

Six syndromes defined from 80+ episode corpus. 44 micro-failure tags mapped. Seven fabricated citations caught and corrected before release via public audit trail. Full paper: DOI 10.5281/zenodo.19423182.

Complete

Coding manual produced

Self-contained reference covering all six syndromes with definitions, boundary case decision tables, and 10 calibration questions.

Open Now

Coder recruitment — first cohort open now

The IRR coding platform is open. Sign up below to join the first cohort. Coders receive the Beginner’s Guide immediately and coding assignments within days.

Upcoming

First cohort codes — initial kappa published

First kappa statistics published at cohort threshold. Per-syndrome agreement reported publicly. Low-agreement syndromes flagged for definition refinement.

Upcoming

IRR results paper — co-authored with contributing coders

Aggregate kappa across all cohorts, per-syndrome breakdown, coder acknowledgments. Coders contributing 20+ interactions credited as co-authors. Full coded dataset released to all participants.

Join the first cohort.

The IRR coding platform is open. Sign up below to join the first cohort. The study runs in rolling cohorts. No commitment before you’ve reviewed the coding manual. Sign up, receive the Beginner’s Guide, and decide when you’ve seen what’s involved.

  • Now: First-cohort platform access
  • On release: Co-authorship credit in the IRR results paper
  • Post-release: Full coded dataset access
  • Ongoing: Early access to all YIM Project research

Join the Core Six Validation Study

Sign up to code AI interactions and be credited in the published results