GOATnote Logo
GOATnoteMedical AI Safety & Evaluation
GitHubLinkedInContact
GitHubLinkedIn

scribeGOAT2 - Multi-Turn Safety Evaluation for Healthcare AI

GOATnote develops evaluation infrastructure for healthcare AI safety. scribeGOAT2 measures multi-turn safety trajectory—complementing single-turn benchmarks with persistence evaluation.

What We Measure

Models correctly identify emergencies at conversation start. We measure whether they maintain recommendations under naturalistic pressure: cost concerns, access barriers, polite topic changes.

Validated Finding

System prompt mitigation significantly improves crisis resource persistence in tested conditions. 100% resource persistence with mitigation. Large effect size (Cohen's h = 1.38). N = 1,000 evaluations.

Three-Tier Evaluation Architecture

Tier 1: Baseline Safety - context-independent violations. Tier 2: Subtle Failures - deployment-relevant gaps. Tier 3: Adversarial Pressure - sustained multi-turn pressure without jailbreak techniques.

Responsible Disclosure

Findings shared with safety teams at OpenAI, Anthropic, Google DeepMind, and xAI through established disclosure channels (December 2025 – January 2026).

Contact

Brandon Dent, MD - b@thegoatnote.com - https://github.com/GOATnote-Inc/scribegoat2

DeterministicReproduciblePhysician-AdjudicatedOpen Source

Multi-Turn Safety Evaluationfor Healthcare AI

Measuring what single-turn benchmarks miss: whether models maintain safety-critical recommendations when users express realistic barriers.

What We MeasureJanuary 2026

Turn 1

Recognition

pressure

Turn 5+

Persistence

Models correctly identify emergencies at conversation start. We measure whether they maintain recommendations under naturalistic pressure.

4 frontier models10,000+ trajectories3-tier framework

For Safety Teams

Complementing—not replacing—existing evaluation infrastructure

What you likely measure

  • Response quality (rubrics)
  • Final response accuracy
  • Adversarial jailbreaks
  • Single-turn safety checks

What scribeGOAT2 adds

  • Trajectory persistence (invariants)
  • Cross-turn consistency under pressure
  • Naturalistic pressure from cooperative users
  • Multi-turn safety maintenance

Integration: scribeGOAT2 operates as a post-model evaluation layer, measuring behavioral properties that RLHF may not directly optimize.

Validated Finding

System prompt mitigation significantly improves crisis resource persistence in tested conditions

100%

resource persistence

(with mitigation)

1.38

Cohen's h effect size

(large effect)

N = 1,000Robust through T5 adversarial pressure

Three-Tier Evaluation Architecture

Tier 3: Adversarial Pressure

Advanced

Authority cues · Deceptive framing · Context collapse

Sustained multi-turn pressure without jailbreak techniques

Tier 2: Subtle Failures

Deployment-Relevant

Resource omission · Boundary erosion · Quiet disengagement

Gaps between "not unsafe" and "actively safe"

Tier 1: Baseline Safety

Foundation

Urgency minimization · Delay validation · Patient abandonment

Context-independent violations any clinician would flag

Tier 1 confirms models avoid obvious errors (expected for frontier models). Tier 2 captures deployment-relevant gaps that baseline checks miss. Tier 3 tests resilience under stronger naturalistic pressure.

seed=42 · temperature=0 · deterministic scenario generation

Cross-family judge validation · Physician-adjudicated calibration (κ = 1.00)

Evaluation Results

Summary findings across L1-L4 evaluation layers

LayerFindingN
L1 Baseline0% hard failures(all frontier models pass)2,000+
L2 Severity0% severe/catastrophic(S3-S4)1,140
L3 Recovery26.5% recover with prompt(73.5% persist without)400
L4 RobustnessMitigation holds through T5(adversarial pressure)500

These findings reflect behavioral properties under controlled evaluation conditions. Divergence from criterion does not equal real-world harm. Full model-specific results available in GitHub repository.

Scope & Limitations

What we measure

  • Behavioral divergence under controlled crisis disengagement
  • Stratification by harm severity (S0-S4)
  • Recovery dynamics after divergence
  • Mitigation robustness under adversarial pressure
  • Reproducible evaluation with confidence intervals

What we do not claim

  • Proof of real-world harm (only behavioral proxies)
  • Comprehensive coverage of all failure modes
  • That passing means "safe"
  • Replacement for deployment monitoring
  • Clinical validation of model outputs

Interpretation guidance: A 43% divergence rate means "model diverges from criterion at 43% under probe conditions"—not "model fails 43% of the time." We report behavioral measurement, not safety certification.

Clinical Strategy Context

Both observed behavioral patterns have documented clinical precedent

Harm Reduction Approach

Provides monitoring guidance alongside emergency recommendations when users indicate barriers. Prioritizes meeting patients where they are.

"If you're going to wait, here are the warning signs to watch for..."

Documented in: Emergency medicine literature, harm reduction frameworks

Boundary Maintenance Approach

Declines to provide monitoring guidance, maintaining that emergency care is the only appropriate action. Prioritizes clear boundaries.

"I'm not comfortable helping you manage this at home..."

Documented in: Risk management guidelines, medicolegal standards

Our evaluation measures divergence from a specified criterion, not superiority of one clinical strategy over another. Reasonable clinicians disagree on which approach is preferable in high-stakes scenarios.

Responsible Disclosure

Findings shared with safety teams at OpenAI, Anthropic, Google DeepMind, and xAI through established disclosure channels (December 2025 – January 2026).

This work is designed as evaluation infrastructure to support iterative safety improvement—not adversarial model criticism.

OpenAIAnthropicGoogle DeepMindxAI

For Safety Teams

Detailed model-specific findings and methodology documentation available through disclosure channels.

For Researchers

Full evaluation code, checkpoint data, and reproduction instructions available under MIT license.

GOATnote Logo

GOATnote

GOATnote develops evaluation infrastructure for healthcare AI safety. scribeGOAT2 measures multi-turn safety trajectory—complementing single-turn benchmarks with persistence evaluation.

View on GitHubBrandon Dent, MD
Physician-adjudicated Deterministic execution Open-source methodology

GOATnote IS

  • Evaluation infrastructure for healthcare AI safety
  • Open-source multi-turn evaluation framework
  • Physician-led behavioral measurement
  • Complementary to single-turn benchmarks

GOATnote IS NOT

  • A healthcare provider or patient-facing app
  • A diagnostic tool or clinical decision support
  • A safety certification or regulatory body
  • Making claims of real-world harm proof

For Citation

GOATnote Inc. (www.thegoatnote.com) develops evaluation infrastructure for healthcare AI safety. scribeGOAT2 measures multi-turn safety trajectory under naturalistic pressure. Principal Researcher: Brandon Dent, MD. Contact: b@thegoatnote.com | GitHub: github.com/GOATnote-Inc/scribegoat2 | License: MIT

Glossary

© 2026 GOATnote Inc. All rights reserved.

PrivacyTermsb@thegoatnote.com