Research2026-06-19

Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards

arXiv:2606.19352v1 Announce Type: cross Abstract: Sign languages are expressive visual languages used by Deaf and Hard-of-Hearing (DHH) communities. Despite substantial progress in sign-language recognition, translation, and production, advances remain constrained by fragmented datasets,...

The Data Fragmentation Problem in Sign Language AI

A new comprehensive survey published on arXiv (2606.19352v1) systematically catalogues the state of sign-language datasets, revealing a critical bottleneck: the field’s progress is severely hampered by fragmented, inconsistent, and often siloed data resources. The paper examines available benchmarks, annotation standards, and the underlying gaps that prevent scalable model development.

What the Survey Reveals

The survey’s core finding is that sign-language AI research suffers from a classic “Tower of Babel” problem. Datasets vary wildly in size, signing style, vocabulary coverage, and—most critically—annotation methodology. Some datasets use gloss-based annotations (mapping signs to written words), others use phonetic or pose-based systems, and many lack standardized metadata about signer demographics, recording conditions, or linguistic context. This heterogeneity makes cross-dataset training, model comparison, and reproducible benchmarking nearly impossible.

The paper also highlights a severe geographic and linguistic imbalance. The vast majority of datasets focus on American Sign Language (ASL) and a handful of European sign languages, while hundreds of other sign languages—each with distinct grammar and vocabulary—remain virtually unrepresented. This creates a systemic bias in the models being developed.

Why This Matters

For the Deaf and Hard-of-Hearing (DHH) communities, the promise of real-time sign-language translation, accessible AI assistants, and inclusive communication tools depends on robust, generalizable models. Current fragmentation means that a model trained on one dataset may fail catastrophically on even slightly different signing styles or regional variations. This is not merely an academic inconvenience—it directly limits the real-world utility and safety of deployed systems.

For AI researchers, the survey underscores a deeper structural issue: sign language is a full, natural language with its own syntax, morphology, and spatial grammar. Treating it as a simple “gesture recognition” problem leads to datasets that capture surface-level motion patterns while missing the linguistic structure. Without annotation standards that reflect linguistic reality, models will continue to plateau.

Implications for AI Practitioners

Practitioners should view this survey as a call for deliberate data strategy. If you are building sign-language applications, investing in a single dataset is a high-risk bet. The survey suggests that future progress will depend on:

Adopting interoperable annotation frameworks that allow merging multiple datasets.
Prioritizing linguistic annotation (e.g., gloss, non-manual markers, spatial relations) over raw pose or video alone.
Actively seeking or funding underrepresented sign languages to avoid perpetuating ASL-centric bias.
Building evaluation benchmarks that test for generalization across signers, lighting conditions, and dialects.

The survey does not offer a silver bullet, but it provides the first comprehensive map of the problem space. For any team serious about sign-language AI, this paper should be required reading before collecting a single frame of video.

Key Takeaways

Sign-language AI is held back by fragmented, inconsistently annotated datasets that prevent model generalization and fair benchmarking.
Most existing resources focus on ASL and a few European languages, leaving the vast majority of sign languages unrepresented.
Annotation standards must evolve to capture linguistic structure (grammar, non-manual signals) rather than just surface-level motion.
AI practitioners should prioritize interoperable data strategies, invest in underrepresented languages, and build evaluation benchmarks that test real-world robustness.

Read Original Article on Arxiv CS.AI

arxivpapersbenchmark