Research2026-07-03

How Should Transformers Encode Numeric Values in Electronic Health Records?

Originally published byArxiv CS.AI

arXiv:2607.01391v1 Announce Type: cross Abstract: How do we encode numeric values in transformer-based sequence processing, particularly in electronic health record (EHR) data? We systematically compare discrete, continuous, and hybrid value encoding strategies using synthetic arithmetic tasks...

This new paper from arXiv tackles a deceptively simple question that has profound implications for clinical AI: how should transformer models actually see numbers in electronic health records (EHRs)? The research systematically compares discrete tokenization (treating numbers like words), continuous encoding (keeping them as floats), and hybrid strategies, using synthetic arithmetic tasks as a controlled testbed.

What the Research Actually Did

The authors constructed synthetic tasks mimicking real EHR arithmetic—things like calculating a change in lab values or computing a risk score from multiple inputs. They then fed these tasks to transformer models using different numeric encoding schemes. The goal was to isolate the encoding variable and measure which approach preserved the most arithmetic fidelity.

The core finding is that naive tokenization—simply breaking "37.5" into tokens ["3", "7", ".", "5"]—performs poorly on even basic arithmetic. Continuous encodings (like learned embeddings for scalar values) fared better, but hybrid approaches that combine discrete positional information with continuous value representations showed the strongest performance. This suggests that transformers need both the magnitude of a number (continuous) and its contextual role (discrete) to reason effectively about clinical data.

Why This Matters for Clinical AI

EHR data is a minefield of numeric values: blood pressure readings, medication dosages, lab results, age, BMI. Every single one of these requires the model to understand not just that a number is present, but what it means in relation to other numbers. A systolic BP of 180 is not just the token "180"—it is a critical threshold that triggers clinical action.

Current practice in many EHR-based transformer models is to simply tokenize numbers as text, often with catastrophic results. Models that can parse "the patient's creatinine rose from 0.8 to 2.1" as a meaningful change rather than a sequence of unrelated tokens will produce far more reliable clinical predictions. This research provides a rigorous framework for evaluating those encoding choices before deploying models in high-stakes medical settings.

Implications for AI Practitioners

For developers building clinical NLP systems, this paper offers a clear warning: your choice of numeric encoding is not a minor implementation detail. It is a first-order architectural decision that directly impacts model performance on tasks requiring numerical reasoning. The hybrid approach suggested here—preserving both the continuous value and its discrete context—aligns well with recent work on position encoding and relative representations.

Practically, this means practitioners should benchmark their numeric encoding strategies on simple arithmetic probes before scaling to full EHR datasets. A model that fails at "add 5 to 7" will certainly fail at "calculate the change in eGFR over three visits." The paper's synthetic task methodology is itself a valuable contribution, offering a reusable evaluation framework.

Key Takeaways

Naive tokenization of numbers as text strings degrades arithmetic reasoning in transformers, which is critical for clinical applications involving lab values, dosages, and risk scores.
Hybrid encoding strategies that combine discrete token representations with continuous value embeddings consistently outperform pure discrete or pure continuous approaches on synthetic arithmetic tasks.
EHR-specific models should be validated on numerical reasoning probes before deployment; a model's ability to understand "180" as a blood pressure value is not guaranteed by its ability to understand "hypertension" as a concept.
The synthetic arithmetic task methodology provides a reusable benchmark for any team developing clinical transformers, enabling rapid iteration on encoding strategies without requiring access to sensitive patient data.

Read Original Article on Arxiv CS.AI

arxivpapers