Research2026-06-24

Predicting Poets' Origins from Verse: A Computational Analysis of Regional Linguistic Fingerprints in the Complete Tang Poems

arXiv:2606.24093v1 Announce Type: cross Abstract: We ask whether the geographic origin of Tang-dynasty poets leaves a detectable linguistic trace in their work. Aggregating every poem attributed to each author in the Complete Tang Poems (Quan Tang Shi) and linking poets to their administrative...

What Happened

Researchers have applied computational linguistics to the Complete Tang Poems (Quan Tang Shi), asking whether a poet’s geographic origin leaves a measurable linguistic fingerprint in their verse. By aggregating all attributed poems per author and linking each poet to their administrative birthplace, the study treats 50,000+ poems as a dataset for regional dialect and stylistic analysis. The core method likely involves vectorizing character sequences, n-gram frequencies, or tonal patterns, then training classifiers to predict origin from text alone.

This is not a trivial task: Classical Chinese poetry is highly formalized, with strict prosodic rules and a shared literary lexicon that might seem to erase regional variation. The fact that the authors found detectable signals—if their results hold—suggests that even within a rigid poetic tradition, subconscious linguistic habits (choice of function words, dialect-influenced rhyme, or topical preferences) persist.

Why It Matters

This work sits at the intersection of digital humanities, sociolinguistics, and AI. For historians, it offers a quantitative tool to verify or challenge existing attributions of anonymous poems—a long-standing problem in classical Chinese scholarship. If a poem’s linguistic profile strongly matches a known region, it could narrow down authorship debates.

For NLP researchers, the study demonstrates that domain-specific, low-resource languages (here, literary Chinese) can still yield signal when treated with careful feature engineering. It also highlights a methodological caution: if regional fingerprints survive in such constrained poetry, then modern NLP models trained on text from diverse geographic sources may encode unintended regional biases—even when the topic is universal.

Implications for AI Practitioners

1. Regional bias in language models is deeper than topic. Most bias detection focuses on demographic or topical skew. This study implies that even syntax and prosody carry geographic markers. Practitioners fine-tuning models on Chinese text (or any language with regional variation) should audit for dialectal leakage that could affect downstream tasks like sentiment analysis or translation. 2. Data augmentation for low-resource historical text. The approach of aggregating all works per author and linking to metadata is a template for other historical corpora. AI teams working on ancient Greek, medieval Latin, or pre-modern Japanese could adopt similar methods to enrich sparse datasets with provenance labels. 3. Interpretability over black-box classification. The paper likely uses logistic regression or tree-based models rather than deep learning, precisely because interpretability matters in humanities research. Practitioners should note that for many real-world problems—especially where domain experts need to trust the output—simpler, explainable models still outperform opaque neural nets. 4. Cross-lingual transfer potential. If regional linguistic fingerprints exist in Tang poetry, similar patterns likely exist in other poetic traditions (e.g., Homeric Greek, Old English). AI tools for authorship attribution or dialect identification could be adapted across languages with minimal retraining, using the same feature extraction logic.

Key Takeaways

Tang dynasty poems contain detectable regional linguistic markers despite formalized literary conventions, enabling birthplace prediction from verse alone.
The study provides a reproducible methodology for linking author metadata to aggregated text, applicable to other historical corpora.
AI practitioners should audit language models for geographic bias that persists even in topic-neutral text, especially in high-stakes cultural or historical applications.
Interpretable machine learning models remain valuable for domains where domain experts must validate and trust the output, rather than relying on black-box predictions.

Read Original Article on Arxiv CS.AI

arxivpapers