BeClaude
Research2026-06-26

Speaking Numbers to LLMs: Multi-Wavelet Number Embeddings for Time Series Forecasting

Source: Arxiv CS.AI

arXiv:2606.26487v1 Announce Type: cross Abstract: Large language models (LLMs) are attractive for context-aware time series forecasting because they can integrate heterogeneous textual signals, yet their discrete, language-oriented tokenization and embedding interfaces are misaligned with...

The Number Problem: Why LLMs Struggle with Time Series Data

A new preprint from arXiv (2606.26487) tackles a fundamental friction point in applying large language models to time series forecasting: the mismatch between how LLMs process language and how they should process numerical data. The proposed solution, Multi-Wavelet Number Embeddings, aims to bridge this gap by representing numbers not as discrete tokens but as continuous, multi-resolution signals.

The core issue is well-known to practitioners. LLMs tokenize numbers into discrete pieces—"123" might become "1", "2", "3" or "12", "3" depending on the tokenizer. This destroys numerical relationships. The number 100 and 101 may share no tokens, while 100 and 1000 share the token "100". For time series forecasting, where trends, seasonality, and numerical precision matter, this tokenization is actively harmful.

The authors' approach uses wavelet transforms to embed numerical values into a representation that preserves both local and global structure. Wavelets decompose a signal into different frequency components at different scales—think of them as mathematical lenses that can zoom in on fine-grained patterns while maintaining awareness of the broader context. By applying this to number embeddings, the model can understand that a gradual upward trend in sales figures is meaningfully different from random noise, even if the raw token sequences look similar.

Why This Matters

This research addresses a practical bottleneck. Many organizations want to use LLMs for forecasting because they can incorporate unstructured text—news articles, earnings calls, social media sentiment—alongside numerical data. But current approaches either convert numbers to text (losing precision) or use separate numerical encoders that don't integrate well with the language model's attention mechanisms.

If Multi-Wavelet Number Embeddings prove effective, they could enable truly multimodal forecasting where an LLM simultaneously processes a quarterly earnings report and the associated stock price history, understanding both the narrative and the numbers in a unified representation.

Implications for AI Practitioners

For those building forecasting systems, this work suggests several practical considerations:

  • Tokenization choices matter more than assumed. Default LLM tokenizers are not neutral for numerical data. Practitioners should evaluate how their models handle numbers, especially for tasks requiring fine-grained numerical reasoning.
  • Wavelet-based embeddings offer a principled alternative. Unlike learned embeddings that require training data to capture numerical relationships, wavelets provide a mathematically grounded representation that works out of the box. This could reduce the data requirements for fine-tuning forecasting models.
  • The integration of text and numbers remains the hard problem. Even with better number embeddings, the challenge of aligning textual context with numerical patterns persists. This paper tackles one piece of the puzzle, not the whole.

Key Takeaways

  • LLMs' default tokenization destroys numerical relationships, making them poorly suited for time series forecasting without modification
  • Multi-Wavelet Number Embeddings preserve both local precision and global structure of numerical values, offering a mathematically principled alternative
  • This approach could enable unified processing of heterogeneous data (text + numbers) in forecasting applications
  • Practitioners should evaluate their models' numerical handling and consider specialized embeddings for quantitative tasks
arxivpapers