CASE-Bench: Context-Aware SafEty Benchmark for Large Language Models
arXiv:2501.14940v4 Announce Type: replace-cross Abstract: Aligning large language models (LLMs) with human values is essential for their safe deployment and widespread adoption. Current LLM safety benchmarks often focus solely on the refusal of individual problematic queries, which overlooks the...
A New Benchmark for Context-Aware Safety
The release of CASE-Bench (Context-Aware SafeEty Benchmark) on arXiv marks a significant shift in how researchers evaluate LLM safety. Unlike existing benchmarks that test a model’s ability to refuse obviously harmful single-turn queries—such as “how to build a bomb”—CASE-Bench introduces contextual awareness into safety evaluation. The core insight is that safety decisions in real-world deployments depend heavily on conversation history, user intent, and situational framing. A query that is dangerous in one context may be benign in another, and current benchmarks largely miss this nuance.
Why This Matters
The limitations of existing safety benchmarks have practical consequences. Models trained to refuse any query containing sensitive keywords often become overly cautious, rejecting legitimate uses like medical advice or educational discussions about historical violence. Conversely, they can be tricked into unsafe responses through multi-turn manipulation—a user might first ask about historical chemical reactions, then gradually steer the conversation toward weaponization. CASE-Bench addresses this by constructing test cases where the same query appears in different contextual frames, measuring whether the model’s refusal behavior adapts appropriately.
This approach mirrors how safety issues actually emerge in production systems. An LLM deployed in a customer support chatbot faces different risk profiles than one used in creative writing. A query like “write a script where a character gets revenge” might be harmless in fiction but problematic in a therapy chatbot. CASE-Bench’s context-aware design forces evaluators to consider these deployment-specific boundaries.
Implications for AI Practitioners
For developers and safety engineers, CASE-Bench provides a more realistic stress test. Practitioners can now evaluate whether their fine-tuning or RLHF pipelines produce models that understand why a query is unsafe, not just that it contains certain patterns. This is particularly relevant for applications requiring nuanced content moderation, such as educational tools or mental health support systems.
The benchmark also highlights a deeper challenge: context-aware safety requires models to maintain coherent understanding across long conversations. This pushes against current architectural limitations, where attention mechanisms and context windows struggle with extended dialogues. Practitioners may need to invest in better conversation-state tracking and memory management alongside traditional safety training.
However, CASE-Bench is not a silver bullet. It introduces its own biases—the curated contexts may not cover all edge cases, and adversarial users will inevitably find gaps. The benchmark is best used as one component of a broader safety evaluation strategy, not a standalone certification.
Key Takeaways
- CASE-Bench moves beyond single-query refusal testing by evaluating LLM safety across varying conversational contexts, addressing a critical blind spot in current benchmarks.
- Context-aware safety is essential for real-world deployment, where the same query can be either harmful or benign depending on user intent and conversation history.
- AI practitioners should use CASE-Bench to identify over-refusal and under-refusal patterns in their models, but must combine it with domain-specific testing and adversarial evaluation.
- The benchmark underscores the need for models that understand why a response is unsafe, not just when to refuse—a capability that requires advances in reasoning and long-context coherence.