Research2026-06-19

Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text

arXiv:2606.19637v1 Announce Type: cross Abstract: Clinical NLP increasingly relies on electronic health record (EHR) data to detect suicidal behaviors, treating clinical documentation as more reliable ground truth than social media. We argue that this framing obscures how EHR-based suicidality...

The Unseen Hand of Dataset Construction in Clinical NLP

A new preprint from arXiv (2606.19637v1) challenges a foundational assumption in clinical natural language processing: that electronic health records (EHRs) provide a more reliable ground truth for detecting suicidality than social media data. The researchers argue that the very act of constructing these datasets—selecting which notes to include, how to define suicidal behaviors, and which clinician annotations to trust—introduces systematic biases that are often invisible to downstream model developers.

What the Research Reveals

The paper contends that EHR-based suicidality detection suffers from what might be called a "documentation paradox." Clinicians document suicidal ideation and behaviors inconsistently, influenced by factors such as patient demographics, institutional protocols, and even the time of day or shift changes. A patient who explicitly mentions suicidal thoughts to one clinician may have that information recorded in a structured field, while another clinician might bury it in free-text narrative notes. The dataset construction process—deciding which documents constitute "positive" cases—thus encodes these clinical workflows as ground truth, rather than capturing the actual prevalence or nature of suicidality.

Furthermore, the authors highlight that EHR data is inherently retrospective and shaped by treatment decisions. A patient admitted after a suicide attempt will have rich documentation, while someone with passive suicidal ideation seen in an outpatient setting may have minimal records. This creates a dataset where severe, acute cases are overrepresented, while subtler or chronic presentations are systematically excluded.

Why This Matters for AI Practitioners

This critique strikes at a core tension in applied machine learning: the gap between what we want to predict (suicidality as a clinical phenomenon) and what we actually label (documented suicidality in a specific healthcare system). For AI practitioners building clinical NLP tools, the implications are threefold:

First, performance metrics can be misleading. A model achieving high accuracy on an EHR-derived test set may simply be learning to recognize documentation patterns—such as the presence of a psychiatry consult note or specific billing codes—rather than genuine clinical risk. When deployed in a different hospital system with different documentation practices, performance can collapse.

Second, bias amplification is a real risk. If certain patient groups (e.g., those with private insurance, or those seen in academic medical centers) are more likely to have detailed clinical notes, models will perform better for those populations while failing for underserved groups whose suicidality is systematically underdocumented.

Third, the framing of "ground truth" matters for regulatory approval. Clinical NLP tools intended for suicide risk screening may face scrutiny from regulators who assume EHR data represents objective clinical reality. This research suggests developers need to be transparent about how datasets were constructed and what assumptions about documentation practices are baked into their models.

Key Takeaways

EHR-based suicidality datasets encode clinical documentation practices as much as they capture patient risk, creating a hidden source of systematic bias.
Model performance on these datasets may reflect the ability to recognize documentation patterns rather than genuine clinical signals, leading to overconfidence in deployment.
AI practitioners should audit their training data for documentation artifacts (e.g., note length, specialty of author, temporal patterns) and consider stratified evaluation across patient subgroups.
Transparency about dataset construction methodology—including inclusion criteria, annotation guidelines, and known documentation gaps—is essential for building trustworthy clinical NLP systems.

Read Original Article on Arxiv CS.AI

arxivpapers