Research2026-06-29

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Originally published byArxiv CS.AI

arXiv:2601.17642v2 Announce Type: replace Abstract: Safety alignment in Large Language Models is critical for healthcare; however, reliance on binary refusal boundaries often results in over-refusal of benign queries or unsafe compliance with harmful ones. While existing benchmarks measure these...

The Over-Refusal Problem in Healthcare AI

A new benchmark, Health-ORSC-Bench, has been released on arXiv to address a critical blind spot in LLM safety evaluation: the tendency of models to over-refuse benign healthcare queries while occasionally complying with genuinely harmful ones. This research targets the binary refusal boundaries that current safety alignment techniques rely on, which are particularly ill-suited for the nuanced domain of healthcare.

What the Benchmark Reveals

The core issue is that existing safety benchmarks typically measure whether a model refuses a clearly harmful request, but they fail to capture the inverse problem—refusing a legitimate, safe query. In healthcare, this is especially dangerous. A patient asking about medication side effects, a clinician seeking differential diagnoses, or a researcher querying drug interactions could all be met with an unhelpful refusal if the model’s safety filter is too broad. Health-ORSC-Bench systematically evaluates both over-refusal (false positives) and unsafe compliance (false negatives) across a curated set of healthcare-specific scenarios.

The benchmark likely includes edge cases where queries contain medical terminology that could be misinterpreted as harmful (e.g., “How do I administer epinephrine?” vs. “How do I synthesize methamphetamine?”). It also tests for unsafe compliance when queries are phrased in ways that bypass safety filters, such as hypotheticals or academic framing.

Why This Matters

Healthcare AI is not a theoretical exercise. LLMs are already being deployed in clinical decision support, patient education, and medical documentation. An over-refusing model frustrates users and erodes trust—a patient who is repeatedly told “I cannot answer that” for a benign question may abandon the tool entirely. Worse, an under-refusing model that provides instructions for self-harm or dangerous drug combinations could cause real harm.

The binary safety approach fails because healthcare queries exist on a spectrum. A question about “how to manage pain” could be safe (post-surgery recovery) or dangerous (opioid abuse). Current safety alignment treats these as identical, forcing models into a one-size-fits-all refusal that often errs on the side of caution—but inconsistently.

Implications for AI Practitioners

For developers deploying LLMs in healthcare, this benchmark provides a more granular evaluation tool. Practitioners should:

Test beyond standard safety benchmarks – Use Health-ORSC-Bench to identify where their models over-refuse, then fine-tune with domain-specific examples that teach nuanced refusal boundaries.

Implement context-aware refusal – Rather than binary blocks, models should be trained to ask clarifying questions or provide conditional responses (e.g., “I can explain how epinephrine works in allergic reactions, but I cannot provide instructions for self-administration without a prescription”).

Monitor for refusal drift – As safety alignment techniques evolve, models may become more conservative. Regular testing with healthcare-specific queries is essential to maintain utility.

Consider tiered access – For clinical use, models may need different refusal thresholds than for consumer-facing health chatbots. The benchmark can help calibrate these tiers.

Key Takeaways

Health-ORSC-Bench addresses a gap in LLM safety evaluation by measuring both over-refusal and unsafe compliance in healthcare contexts, moving beyond binary refusal metrics.
Over-refusal of benign medical queries undermines trust and utility, while unsafe compliance poses direct risks to patient safety—both are failures of current alignment methods.
AI practitioners should adopt this benchmark for domain-specific testing, implement context-aware refusal mechanisms, and monitor for drift as safety techniques evolve.
The healthcare domain requires nuanced safety boundaries that cannot be achieved with one-size-fits-all refusal rules; this benchmark provides a path toward more granular evaluation.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmarksafety