Research2026-06-30

Em-ergence of the em-dash: a population-level rise in em-dash frequency in medRxiv preprints at the dawn of the large-language-model era

Originally published byArxiv CS.AI

arXiv:2606.29540v1 Announce Type: cross Abstract: Large language models (LLMs) can leave subtle stylistic traces in assisted text; one of the most cited is the em-dash (Unicode U+2014). Yet no one has measured whether em-dash use has changed in the scientific literature. This study, pre-registered...

The Em-Dash as a Digital Fossil

A new preprint on arXiv (2606.29540v1) has done something surprisingly concrete: it measured a statistically significant rise in em-dash (U+2014) usage across medRxiv preprints coinciding with the widespread adoption of large language models. The researchers pre-registered their study and analyzed millions of preprints, finding that em-dash frequency began climbing sharply around late 2022—precisely when ChatGPT and other LLMs entered mainstream use.

This is not a trivial stylistic curiosity. The em-dash is a specific typographic marker that LLMs tend to overuse compared to human writers, likely because training data from formal writing (books, academic papers) contains many em-dashes for parenthetical clauses, and the model learns to reproduce this pattern excessively. The study controlled for confounding factors like changes in author demographics or editorial policies, strengthening the case that LLM-assisted writing is the primary driver.

Why This Matters

The finding has three important implications:

First, it provides a scalable, low-cost signal for detecting LLM-generated or LLM-assisted text in scientific publishing. Unlike complex watermarking schemes or AI-detection tools that are easily evaded, stylistic markers like em-dash frequency are embedded in the writing process itself. Journals and preprint servers could monitor such shifts as a red flag, though this is not a silver bullet—adversarial users could deliberately reduce em-dash usage.

Second, the study highlights how LLMs are homogenizing scientific writing style. If em-dash frequency is rising across the board, other stylistic quirks likely are too. This could erode the distinctiveness of individual author voices and potentially mask deeper problems like reduced novelty in AI-generated content.

Third, the research underscores the need for pre-registered, longitudinal studies of AI's impact on academic output. Most discussions of LLM-generated text rely on anecdotal evidence or small samples. This paper provides a replicable methodology that could be extended to other punctuation marks, sentence structures, or vocabulary choices.

Implications for AI Practitioners

For developers and users of LLMs in scientific contexts, this is a practical caution:

Stylistic fingerprints are real and measurable. If you use an LLM to draft or polish a manuscript, you may be leaving detectable traces—even if you carefully fact-check the content. The em-dash is just one example; future studies will likely find more.
Prompt engineering can mitigate this. Explicitly instructing the model to avoid overusing em-dashes, or to mimic the stylistic patterns of a specific journal or author, may reduce the signal. However, this requires awareness and effort.
The scientific community needs norms. Should LLM-assisted writing be disclosed? Should journals set style guidelines that explicitly discourage LLM-typical patterns? These are open questions, but the data now exist to inform the debate.

Key Takeaways

Em-dash frequency in medRxiv preprints has risen significantly since late 2022, correlating with LLM adoption.
This provides a measurable, low-cost proxy for detecting LLM-assisted writing in scientific literature.
AI practitioners should be aware that stylistic artifacts (like punctuation overuse) can persist even in carefully edited LLM output.
The study demonstrates the value of pre-registered, longitudinal analysis for understanding AI's real-world impact on academic writing.

Read Original Article on Arxiv CS.AI

arxivpapers