BeClaude
Research2026-05-06

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

Source: Arxiv CS.AI

arXiv:2509.00673v2 Announce Type: replace-cross Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more heavily aligned (censored) counterparts in a...

arxivpaperssafety