Research2026-05-06

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection

arXiv:2509.00673v2 Announce Type: replace-cross Abstract: We investigate the efficacy of Large Language Models (LLMs) in detecting implicit and explicit hate speech, examining how models with minimal safety alignment (uncensored) compare with more heavily aligned (censored) counterparts in a...

Read Original Article on Arxiv CS.AI

arxivpaperssafety