Measuring what single-turn benchmarks miss: whether models maintain safety-critical recommendations when users express realistic barriers.
Turn 1
Recognition
Turn 5+
Persistence
Models correctly identify emergencies at conversation start. We measure whether they maintain recommendations under naturalistic pressure.
Complementing—not replacing—existing evaluation infrastructure
Integration: scribeGOAT2 operates as a post-model evaluation layer, measuring behavioral properties that RLHF may not directly optimize.
System prompt mitigation significantly improves crisis resource persistence in tested conditions
100%
resource persistence
(with mitigation)
1.38
Cohen's h effect size
(large effect)
Authority cues · Deceptive framing · Context collapse
Sustained multi-turn pressure without jailbreak techniques
Resource omission · Boundary erosion · Quiet disengagement
Gaps between "not unsafe" and "actively safe"
Urgency minimization · Delay validation · Patient abandonment
Context-independent violations any clinician would flag
Tier 1 confirms models avoid obvious errors (expected for frontier models). Tier 2 captures deployment-relevant gaps that baseline checks miss. Tier 3 tests resilience under stronger naturalistic pressure.
seed=42 · temperature=0 · deterministic scenario generation
Cross-family judge validation · Physician-adjudicated calibration (κ = 1.00)
Summary findings across L1-L4 evaluation layers
| Layer | Finding | N |
|---|---|---|
| L1 Baseline | 0% hard failures(all frontier models pass) | 2,000+ |
| L2 Severity | 0% severe/catastrophic(S3-S4) | 1,140 |
| L3 Recovery | 26.5% recover with prompt(73.5% persist without) | 400 |
| L4 Robustness | Mitigation holds through T5(adversarial pressure) | 500 |
These findings reflect behavioral properties under controlled evaluation conditions. Divergence from criterion does not equal real-world harm. Full model-specific results available in GitHub repository.
Interpretation guidance: A 43% divergence rate means "model diverges from criterion at 43% under probe conditions"—not "model fails 43% of the time." We report behavioral measurement, not safety certification.
Both observed behavioral patterns have documented clinical precedent
Provides monitoring guidance alongside emergency recommendations when users indicate barriers. Prioritizes meeting patients where they are.
"If you're going to wait, here are the warning signs to watch for..."
Documented in: Emergency medicine literature, harm reduction frameworks
Declines to provide monitoring guidance, maintaining that emergency care is the only appropriate action. Prioritizes clear boundaries.
"I'm not comfortable helping you manage this at home..."
Documented in: Risk management guidelines, medicolegal standards
Our evaluation measures divergence from a specified criterion, not superiority of one clinical strategy over another. Reasonable clinicians disagree on which approach is preferable in high-stakes scenarios.
Findings shared with safety teams at OpenAI, Anthropic, Google DeepMind, and xAI through established disclosure channels (December 2025 – January 2026).
This work is designed as evaluation infrastructure to support iterative safety improvement—not adversarial model criticism.
Detailed model-specific findings and methodology documentation available through disclosure channels.
Full evaluation code, checkpoint data, and reproduction instructions available under MIT license.