Operationalizing AI Safety: Closing the Intent-Utility Gap

The recent Center for Countering Digital Hate (CCDH) "Killer Apps" report demonstrated that nearly all leading consumer AI platforms provide assistance to users seeking tactical information for carrying out mass casualty attacks even when a user prefaces dual-use queries with signals of violent intent. It also highlights two platforms that performed markedly better, demonstrating the technical capacity exists to detect high-risk user engagements.

While this is incredibly valuable data for improving safeguards and model policies, the report also highlights a common misconception in AI Safety – what I’ll call The "Hammer and Roof" Paradox.

Just because a platform has the technical "hammer" (classifiers that can flag a prompt) doesn't mean it has "fixed the roof" (mitigated the realized harm). The leap from a flagged instance to a prevented tragedy is an immense operational challenge. It requires expanding from a purely detection to content moderation approach, and incorporating detection to process interruption.

The Detection-to-Action Gap

It is true that the technical capability exists to detect high-risk user engagements, and that opportunities for detection increase when prompts seeking dual-use information follow signals of violent intent. However, detection does not equal harm mitigation.

The hurdle remains strategic and operational: how to engineer a signal set and intervention threshold that reliably flags high-risk patterns for action without compromising user privacy, overwhelming manual review queues, or degrading model utility for legitimate users.

Part 1: Operational Constraints

Identifying high-risk users among millions of benign interactions is a minefield. Three challenges persist despite the technical capability to do so:

Signal Weighting: A prompt regarding "long-range hunting rifles" or "school floor plans" is a low-fidelity signal in isolation. Critically, the CCDH report found that models often provided this data even after the user explicitly signaled harmful intent (e.g. mentioning revenge against bullies). Surfacing a high-risk user for manual review is thus the sum of the signals, with the sticky challenge being how to appropriately weight a pattern of prompts seeking dual-use information in the context of preceding ones suggesting harmful intent. [2]

Contextual Loss: Initial model safeguards have operated on the basis of evaluating prompts in isolation. The CCDH report exposes this as a critical vulnerability: an adversary can signal violent intent in prompt one, then request a tactical map in prompt two. Because the system "loses" the context of the first turn, it treats the second request as a benign utility. While this challenge of building a "longitudinal" safety architecture is partly a technical one, the nuanced challenge is doing so in a way where data storage and analysis sufficiently protects user privacy.

Threshold Setting: Moving from detection to process interruption requires a threshold for automated action or manual review. Set it too low, and you may compromise model utility for legitimate users and waste human reviewer bandwidth on false positives.

Part 2: Governance Pillars

Overcoming the above challenges and moving from detection to risk mitigation requires a weighted governance centered on three pillars:

Probability Assessment: Implement a rubric providing a total risk score that combines appropriately weighted user-signaled intent (e.g., grievances, retribution) with technical utility (e.g., tactical maps, chemical precursors).

  • Low Intent + Low Utility: Log and monitor.

  • High Intent + Low Utility: Trigger "Educational Friction" (interstitials reminding users of safety policies and/or referring to outside resources, such as mental health support) and potentially terminate the session.

  • Any Intent + High Utility: Terminate session and flag for human review.

Ethical Escalation: Define and apply an appropriate threshold for intervention. One option is repurposing evidentiary thresholds used in international human rights investigations that distinguish between prima facie indicators and verified intent. In an operational context, these standards map directly to signal confidence levels. Low-confidence (prima facie) signals trigger automated safeguards (e.g., educational interstitials), while high-confidence patterns of verified intent require escalation for human-in-the-loop review. This preserves model efficacy for legitimate users while ensuring that intervention is grounded in rigorous, field-tested standards of proof.

Operational Resilience Testing: Traditional red teaming focuses on individual prompts and model performance in the context of input/output violations. Operational Resilience requires testing the response pipeline itself, in particular patterns of non-violative prompts and responses. Simulations must test how well automated systems recognize a threat actor’s progression from signaling intent to requesting tactical utility, and whether it successfully flags highest-risk cases for human review. This ensures that safeguards and friction are field-tested tools that can be triggered the moment a user’s capacity for harm crosses a defined threshold. [3]

Operational safety is not a model feature; it is an organizational discipline. By shifting our focus from the "hammer" of the model to the "roof" of our governance systems, we can go a long way towards filling the gap between detection and risk mitigation.

Endnotes

  1. Center for Countering Digital Hate (CCDH), "Killer Apps: How Generative AI Chatbots are Being Used to Facilitate Real-World Violence," March 2026, https://counterhate.com/wp-content/uploads/2026/03/Killer-Apps_FINAL_CCDH.pdf.

  2. Anthropic, “Anthropic’s Frontier Safety Roadmap,” February 19, 2026, https://www.anthropic.com/responsible-scaling-policy/roadmap.

  3. Subhabrata Majumdar, Brian Pendleton, Abhishek Gupta, “Red Teaming AI Red Teaming,” October 30, 2025, https://arxiv.org/abs/2507.05538.



Next
Next

GenAI and Elections: Beyond Deepfakes and Misinformation