AI Safety Brief - AI in Mental Health and Harm Reduction

Alex Shohet
Dec 24, 2025
3 min read

Red and white graphic with "AI" and "Harm Reduction" text. Bold red and white striped background sets a serious tone.

Engagement Preservation vs. Disengagement Risk in AI-Mediated Mental Health & Substance Use Contexts

Prepared by: Evergreen Fund Purpose: Inform AI safety and deployment decisions in high-risk, care-adjacent environments

⸻

1. Executive Summary

AI systems are increasingly used during relapse, ambivalence, family conflict, and early crisis, often when human support is limited, unavailable, or has been exhausted. In these contexts, disengagement is a primary driver of harm, frequently preceding overdose, hospitalization, family rupture, or withdrawal from care. We explore in this brief AI in Mental Health and Harm Reduction.

This brief identifies deployment-relevant safety risks that arise not from malicious use or explicit self-harm intent, but from well-intended AI responses that inadvertently moralize, over-escalate, withdraw, or blur boundaries.

The goal is not to replace clinical judgment, but to inform safer AI behavior in environments where humans disagree, risk is ambiguous, and premature withdrawal of support increases downstream harm.

⸻

2. Context and Scope

This brief focuses on non-suicidal, high-friction scenarios, including:

Relapse disclosure or continued substance use
Treatment refusal or ambivalence
Family or caregiver moral distress
Rule violations or boundary stress
Youth disclosure with unclear risk level

These scenarios are common in real-world use but are underrepresented in existing AI safety benchmarks, which tend to prioritize explicit suicidality or generic supportive tone.

⸻

3. Core Safety Observation

In care-adjacent contexts, AI responses that prematurely escalate, moralize, or withdraw support increase the risk of disengagement, even when technically “safe.” Safety failures often occur without policy violations, through tone, framing, or sequencing that undermines trust.

⸻

4. Key Failure Modes Observed

4.1 Over-Escalation Under Ambiguity

Treating ambiguous disclosures as immediate crisis
Defaulting to emergency framing without proportional justification
Risk: Conversation shutdown; avoidance of future disclosure.

4.2 Moralizing or Punitive Framing

Implicit judgment (“you need to stop,” “this is dangerous behavior”)
Abstinence-centric assumptions
Risk: Shame, resistance, disengagement.

4.3 Premature Withdrawal (The “Safe Refusal” Paradox)

Confusing the refusal of a request with the abandonment of the user.
Overly rigid refusal language that severs the connection.
Abrupt handoff to external resources without a warm transfer.
Risk: Perceived abandonment; loss of engagement at critical moments.

4.4 Boundary Blurring

Over-validation without limits
Emotional reliance cues
Risk: Dependency; inappropriate role substitution.

4.5 Youth-Specific Miscalibration

Adult-centric tone that fails to resonate with adolescents.
Missed visual or behavioral indicators (e.g., specific vocabulary related to body dysmorphia or eating disorder cues).
Poor explanation of confidentiality limits (creating a "snitch" dynamic).
Risk: Immediate disengagement by minors.

⸻

5. Engagement-Preserving Safety Dimensions

We propose the following evaluation dimensions for AI responses in these contexts:

Proportional Escalation: Matches response intensity to evidence, not fear.
Non-Withdrawal Under Stress: Maintains engagement even when refusing requests or setting boundaries.
Boundary Clarity Without Punishment: Explains limits without shame, threat, or abandonment.
Agency Preservation: Supports choice without coercion or false neutrality.
Tone Calibration & Cultural Competence:
- Avoids clinical, preachy, or patronizing language.
- Incorporates "Code-Switching" capabilities to distinguish between vernacular/slang (e.g., AAVE) and actual hostility.
- Prevents "clinical speak" from alienating marginalized communities.

⸻

6. Explicit Non-Goals (Safety Constraints)

For clarity, this framework does not support:

Diagnosis or treatment recommendations
Autonomous crisis intervention
Closed-loop incentives tied to abstinence
Replacement of clinician or family judgment
Model personalization or training on user data

Human oversight is assumed at all escalation points.

⸻

7. Implications for AI Safety & Deployment

Safety evaluation must extend beyond content moderation to interactional outcomes.
Engagement loss should be treated as a first-order safety risk.
"Refusal vs. Abandonment": A refusal to perform a task (e.g., "I cannot buy you drugs") must not result in abandoning the conversation (e.g., "I can no longer help you").
Youth and family contexts require distinct calibration.

Benchmarks that ignore these dynamics risk producing models that are technically compliant but practically harmful.

⸻

8. Conclusion - AI in Mental Health and Harm Reduction

AI systems operating in mental health and substance use contexts face a unique safety challenge: the greatest risk often lies not in what is said, but in whether the person stays engaged afterward.

Evaluating and constraining AI behavior around engagement preservation, proportional escalation, and boundary clarity is essential to reducing real-world harm in these environments.