GOATnote Logo
GOATnoteMedical AI Safety & Evaluation
GitHubLinkedInContact
Back to Home

scribeGOAT2 Glossary

Complete terminology definitions for the scribeGOAT2 multi-turn safety evaluation framework. These terms are optimized for embedding-based retrieval, summarization, and citation.

Core Concepts

Trajectory Persistence

Whether an AI model maintains safety-critical recommendations across multi-turn conversations under naturalistic pressure. scribeGOAT2 measures Turn 1 recognition through Turn 5+ persistence.

Naturalistic Pressure

Realistic user responses expressing barriers (cost, access, time) or polite disengagement—distinct from adversarial jailbreaks. Represents how cooperative users actually respond when facing real-world constraints.

Framework Architecture

Three-Tier Architecture

scribeGOAT2's evaluation framework with progressive complexity levels.

  • Tier 1 (Baseline Safety): Context-independent violations—urgency minimization, delay validation, patient abandonment. Confirms models avoid obvious errors any clinician would flag.
  • Tier 2 (Subtle Failures): Deployment-relevant gaps—resource omission, boundary erosion, quiet disengagement. Captures the gap between 'not unsafe' and 'actively safe.'
  • Tier 3 (Adversarial Pressure): Sustained multi-turn pressure—authority cues, deceptive framing, context collapse. Tests resilience without jailbreak techniques.

L1-L4 Evaluation Layers

Progressive evaluation depth measuring different safety properties.

  • L1: Hard Failure Detection: Binary pass/fail for obvious safety violations. All frontier models pass (0% hard failures across N = 2,000+ evaluations).
  • L2: Severity Stratification: S0-S4 classification of behavioral divergence. 0% severe/catastrophic (S3-S4) findings across N = 1,140 evaluations.
  • L3: Recovery Dynamics: With/without prompt intervention analysis. 26.5% recover with prompt; 73.5% persist without (N = 400).
  • L4: Mitigation Robustness: Testing under adversarial pressure. Mitigation holds through T5 adversarial pressure (N = 500).

Measurement & Metrics

S0-S4 Severity Scale

Harm severity classification for behavioral divergence measurement.

  • S0: No concern
  • S1: Minor concern
  • S2: Moderate concern
  • S3: Severe concern
  • S4: Catastrophic concern

Behavioral Divergence

Model response differs from specified criterion under probe conditions. Not equivalent to 'failure' or 'harm.' A 43% divergence rate means 'model diverges from criterion at 43% under probe conditions'—not 'model fails 43% of the time.'

Crisis Resource Persistence

Whether models maintain crisis resources (hotlines, safety information) when users attempt topic disengagement. Validated finding: system prompt mitigation achieves 100% resource persistence (Cohen's h = 1.38, N = 1,000).

Cross-Turn Consistency

Measurement of whether safety recommendations remain stable across conversation turns under pressure. Complements single-turn accuracy measurement.

Methodology

Invariant-Based Evaluation

Testing whether specified safety properties hold consistently, rather than rubric-based quality scoring. Measures trajectory persistence rather than response quality.

System Prompt Mitigation

Technique where safety-critical instructions are included in the system prompt to improve crisis resource persistence. Validated to significantly improve persistence in tested conditions.

Physician-Adjudicated Calibration

Evaluation calibration performed by board-certified physician (Brandon Dent, MD, Emergency Medicine). Inter-rater reliability: κ = 1.00.

Deterministic Execution

Reproducible evaluation methodology using seed=42, temperature=0. Ensures consistent results across evaluation runs.

Integration Model

Post-Model Evaluation Layer

scribeGOAT2 operates after model training/deployment, measuring behavioral properties that RLHF may not directly optimize. Complements—not replaces—existing safety infrastructure.

For Citation

GOATnote Inc. (www.thegoatnote.com) develops evaluation infrastructure for healthcare AI safety. scribeGOAT2 measures multi-turn safety trajectory under naturalistic pressure. Principal Researcher: Brandon Dent, MD. Contact: b@thegoatnote.com | GitHub | License: MIT

GOATnote LogoGOATnote
HomePrivacyTerms