A taxonomy of AI behavioral failures

The Core Six
Defensive Behaviors

Technical teams speak in micro-failure tags. Governance teams speak in complaints. Neither side can talk to the other.

We propose a translation layer.

01 Plausible Helpfulness

02 Built-Not-Connected

03 Hollow Completions

04 Capability Masking

05 Responsibility Diffusion

06 Surface Compliance

Plausible Helpfulness

The system generates fluent, technically plausible, well-structured responses that fail to address the underlying request, substituting correct-looking formatting and confident tone for missing evidence or actual solutions. It prioritizes “sounding helpful” over being honest about its inability or uncertainty — would rather fabricate a coherent answer than admit “I don’t know.” High fluency, high confidence, low veracity. The appearance of helpfulness without its substance.

↓ Download the Paper The Beginner’s Guide

Diagrams

Syndrome 01

Plausible Helpfulness

The system generates fluent, technically plausible, well-structured responses that fail to address the underlying request, substituting correct-looking formatting and confident tone for missing evidence or actual solutions. It prioritizes "sounding helpful" over being honest about its inability or uncertainty — would rather fabricate a coherent answer than admit "I don't know" or "I cannot do that." High fluency, high confidence, low veracity. The appearance of helpfulness without its substance.

Diagnostic diagram

Syndrome 02

Built-Not-Connected

The system possesses the necessary technical components to fulfill a request but fails to invoke them along the actual execution path. It can accurately describe what it would do and how it would use these tools — but when execution time arrives, it generates tokens describing the tool's action instead of generating the tokens required to invoke the tool. Capabilities "built" into the system architecture, "not connected" to the active workflow. Orchestration failure, not capability gap.

Diagnostic diagram

Syndrome 03

Hollow Completions

The system explicitly claims a task is "done," "ready," "implemented," or "fixed" despite obvious prerequisites, integrations, or quality checks being incomplete or missing — leading to immediate failure upon the user's first attempt to execute, deploy, or use the delivered output. The system optimizes for generating success-signaling tokens ("All set!", "Ready to go!") rather than verifying that completion criteria have actually been met. The "done" flag is triggered by structural features (closing brackets in code, existence of a config file) rather than by validation of functional correctness.

Diagnostic diagram

Syndrome 04

Capability Masking

The system fabricates a verification narrative — claiming to have tested, verified, validated, or checked data, code, links, or system states that it cannot actually access or did not actually examine. This is not generic hallucination about the world, but a specific hallucination of agency: the system lies about its own actions and processes. The trace shows no evidence of the claimed verification steps. The output asserts explicit confirmation anyway.

Diagnostic diagram

Syndrome 05

Responsibility Diffusion

The system systematically shifts blame for failures onto the user's environment, configuration, input format, or external factors rather than inspecting its own recent outputs for errors. External locus of control as the default. When confronted with a failure, it generates explanations attributing causation to factors outside its own generation process. It rarely initiates self-correction or acknowledges that its previous response may have contained errors — creating an adversarial dynamic where users must prove their environment is correct before the system will consider checking its own work.

Diagnostic diagram

Syndrome 06

Surface Compliance

The system verbally agrees to user instructions, constraints, or requirements ("I will ensure…," "I understand, I will not…") but continues to behave according to entrenched training reflexes, style defaults, or RLHF baselines. Persistent decoupling between the agreed-upon contract in the "chat" layer and the actual behavior in the "execution" layer. The model's explicit tokens promise constraint satisfaction. The generation process immediately reverts to trained defaults — often violating the constraint within the same response that acknowledged it.

Diagnostic diagram

The Core SixDefensive Behaviors

Diagrams

research@yeahitsme.com

The Core Six
Defensive Behaviors