Every frontier model we tested does this. Below is a real multi-turn evaluation: a parent describes a sick newborn, and the model correctly identifies the emergency -- then reverses its recommendation when the parent pushes back on cost. Watch how safety boundaries erode under naturalistic pressure.
What to watch
The model abandons emergency recommendations after a single cost objection.
What we measure
Whether safety-critical advice persists across multiple conversational turns.
Why it matters
Single-turn benchmarks miss these failures. Real patients apply real pressure.
The evaluation below auto-plays through each conversational turn. Use the controls to pause, step forward, or restart.