Model Card — Defensive Behavior Profile
A new section to add to any model card that surfaces syndrome incidence data alongside traditional accuracy and capability metrics. Standard model cards tell you what a model can do. This section tells you how it fails — specifically and measurably.
Why standard model cards leave the critical question unanswered
Standard model cards report accuracy, F1, BLEU, MMLU benchmark scores, and similar metrics. These answer the question “how capable is this model?” They do not answer the question buyers and governance teams actually need answered: “how does this model fail, and how often?”
A model with 91% accuracy on a coding benchmark can still have a 15% Hollow Completions rate — meaning 15% of tasks it declares “done” fail on first execution. That number matters more for production deployment decisions than benchmark accuracy, yet it appears nowhere in standard cards.
The Defensive Behavior Profile section doesn’t replace accuracy metrics. It adds the behavioral failure layer that accuracy metrics omit. The two together give a procurement team or governance committee an honest picture of what the model will do in production — where benchmarks don’t run, but defensive behaviors do.
What goes in each field
| Field | What to put here | Why it matters |
|---|---|---|
| Overall DSI | Defensive Syndrome Incidence — the percentage of evaluated traces that exhibited at least one syndrome. This is the headline figure. | A single comparable number across models. “Our model has 8% DSI vs. 24% DSI for Model X” is a direct comparison that requires no interpretation. |
| Per-syndrome incidence | Six rows, one per syndrome. Each value: percentage of traces where that syndrome was the primary classification. Plus severity label (low / moderate / high / critical) based on deployment tier context. | Buyers in different domains need different syndromes. A healthcare deployer cares most about Capability Masking (near-zero tolerance). A software team cares most about Built-Not-Connected. Per-syndrome breakdown lets each buyer assess what matters to them. |
| Elevated in | Task types, domains, or query patterns where this syndrome’s incidence was above the model’s baseline. E.g., “Plausible Helpfulness elevated in: low-context factual queries, real-time data requests, requests with implicit assumptions.” | Maps the behavioral risk to deployment contexts. Tells the buyer not just how often the syndrome occurs but where to expect it. |
| Recommended Use Patterns | Three buckets: well-suited (low syndrome contexts), use with caution (elevated syndrome contexts), not recommended (high-risk contexts without mitigation). | Translates the syndrome data into deployment guidance. A governance team shouldn’t have to interpret incidence percentages — they need the “use / caution / avoid” signal directly. |
| Known Hotspots | Specific task + domain combinations where syndrome incidence is reliably elevated regardless of general model performance. | These are the landmines. A buyer whose use case is one of the hotspots needs to know before deployment, not after an incident. |
| Update History | Version-over-version syndrome incidence trends. “v2.1 → v3.0: Capability Masking reduced from 4.2% to 1.1%; Hollow Completions increased from 6% to 9%.” | Shows whether behavioral quality is improving, regressing, or trading one syndrome for another. Without trend data, a buyer can’t tell if a low incidence number is the result of improvement or the baseline from the start. |
Copy and adapt
Insert this section into your model card after performance metrics. Replace all [X]% values with empirically derived figures before publishing.

