Research2026-06-19

Toten: Knowledge-Based Ontological Tokenization Of Physical Quantities And Technical Notation In Brazilian Portuguese

arXiv:2606.19626v1 Announce Type: new Abstract: Byte-Pair Encoding tokenization is statistically efficient for vocabulary compression, but semantically blind to structured technical entities, fragmenting physical quantities, numbers, units, and symbolic expressions into lexically arbitrary...

Tokenization is the silent gatekeeper of language model comprehension. Most practitioners treat it as a solved preprocessing step, but a new preprint from Brazilian researchers exposes a critical blind spot: when tokenizers encounter structured technical notation—units like “kg·m/s²,” numbers with decimal separators, or symbolic expressions—they fragment them into semantically meaningless pieces. The paper introduces Toten, a knowledge-based ontological tokenizer designed specifically for physical quantities and technical notation in Brazilian Portuguese.

What Happened

The authors argue that Byte-Pair Encoding (BPE), the dominant tokenization algorithm behind models like GPT and Llama, achieves admirable vocabulary compression but is “semantically blind” to structured entities. A physical quantity such as “5.0×10³ N·m” might be split across multiple tokens that break the relationship between the coefficient, the exponent, and the unit. Toten replaces this statistical approach with an ontology-driven pipeline: it first identifies technical entities using domain-specific grammars and unit ontologies, then tokenizes them as atomic or structured units that preserve their mathematical and physical meaning. The system is tailored to Brazilian Portuguese, addressing locale-specific notation like decimal commas and unit abbreviations (e.g., “km/h” versus “km·h⁻¹”).

Why It Matters

This work strikes at a fundamental tension in modern NLP: statistical efficiency versus semantic fidelity. For general text, BPE’s compromises are acceptable because the semantic units are often words or subwords that align reasonably well with token boundaries. But in technical domains—physics, engineering, chemistry, finance—the cost of fragmentation is high. A model that cannot reliably parse “9.81 m/s²” as a single acceleration value will struggle with unit conversions, dimensional analysis, or even basic arithmetic reasoning. The problem compounds in retrieval-augmented generation (RAG) systems, where chunked technical documents lose the coherence of equations and measurements.

The focus on Brazilian Portuguese is also strategically important. Most tokenization research centers on English, but technical notation is not language-neutral: decimal separators, unit conventions, and number formatting vary significantly. A tokenizer that works for “1.5 kg” in English may fail on “1,5 kg” in Portuguese. Toten demonstrates that locale-aware tokenization is not a luxury but a necessity for deploying LLMs in scientific and industrial contexts outside the Anglosphere.

Implications for AI Practitioners

For developers building domain-specific applications—scientific assistants, engineering chatbots, or financial analysis tools—Toten offers a template for how to move beyond one-size-fits-all tokenization. The immediate takeaway is that off-the-shelf models will mishandle technical notation, and fine-tuning alone may not fix the root cause. Practitioners should audit their tokenizer’s behavior on domain-specific inputs and consider hybrid approaches: use BPE for general language, but overlay a rule-based or ontology-driven tokenizer for structured technical entities.

The broader implication is that tokenization is becoming a differentiation point. As models commoditize, the quality of domain adaptation will hinge on how well the input representation captures domain semantics. Toten points toward a future where tokenizers are not black-box statistical tools but engineered components that embed domain knowledge.

Key Takeaways

BPE tokenization fragments physical quantities and technical notation, breaking semantic relationships that models need for accurate reasoning.
Toten uses ontological knowledge and locale-specific rules to tokenize units, numbers, and symbols as coherent entities, demonstrated for Brazilian Portuguese.
Domain-specific applications (science, engineering, finance) require tokenization strategies that preserve technical semantics, not just statistical compression.
Practitioners should evaluate tokenizer behavior on structured technical inputs and consider hybrid pipelines that combine general-purpose and ontology-driven tokenization.

Read Original Article on Arxiv CS.AI

arxivpapers