Real-Time Hard Negative Sampling via LLM-based Clustering for Large-Scale Two-Tower Retrieval
arXiv:2607.00448v1 Announce Type: cross Abstract: The two-tower model has been widely used for large-scale recommendation systems, particularly in the retrieval stage. Industry standards for training two-tower models typically involve in-batch and/or out-of-batch negative sampling. However, these...
A Smarter Way to Find Hard Negatives in Two-Tower Retrieval
A new arXiv paper proposes a method for real-time hard negative sampling in two-tower retrieval models, addressing a long-standing bottleneck in large-scale recommendation systems. The core innovation uses LLM-based clustering to dynamically identify and sample hard negatives during training, rather than relying on the static or semi-random approaches common in industry.
What the Research ProposesThe two-tower architecture—where user and item embeddings are learned separately and matched via similarity search—is the backbone of many modern recommendation and retrieval systems. The standard training approach uses in-batch negatives (other items in the same mini-batch) or out-of-batch negatives (randomly sampled items). The problem is that these negatives are often too easy: they don't force the model to learn fine-grained distinctions between similar items.
This paper introduces a clustering step powered by large language models (LLMs) to group semantically similar items. During training, the system can then sample hard negatives from the same cluster as the positive item, ensuring the model must learn to differentiate between genuinely confusable candidates. The key advance is that this clustering happens in real-time, adapting as the model's embeddings evolve—a significant improvement over static pre-computed clusters.
Why This MattersThe practical impact is substantial. In production recommendation systems, the retrieval stage often filters millions of candidates down to hundreds. If the model can't distinguish between a user's true preference and a very similar but irrelevant item, the downstream ranking stage is burdened with noise. Hard negative sampling directly improves this discrimination ability.
The use of LLMs for clustering is particularly clever. Traditional clustering methods (like k-means on embeddings) can fail to capture nuanced semantic relationships. LLMs bring natural language understanding to the clustering process, enabling the system to group items based on conceptual similarity—e.g., "horror movies from the 1990s" versus "horror movies with psychological twists"—rather than just embedding distance.
Implications for AI PractitionersFor engineers building retrieval systems, this approach offers a practical upgrade path. The method can likely be integrated into existing two-tower pipelines without overhauling the architecture. The main cost is the LLM clustering step, but the paper suggests this can be made efficient enough for real-time use.
However, practitioners should consider the trade-offs. LLM-based clustering adds latency and computational overhead. The approach may be most valuable in domains where items have rich semantic descriptions (e.g., product catalogs, content libraries) rather than purely behavioral signals. Teams should also evaluate whether their current negative sampling strategy is already adequate—if in-batch negatives are already challenging, the marginal benefit may be small.
Key Takeaways
- Dynamic hard negatives via LLM clustering: The method uses real-time LLM-based clustering to sample negatives from the same semantic group as positives, forcing the model to learn finer distinctions.
- Improves two-tower retrieval quality: Better negative sampling directly enhances the model's ability to rank relevant items above confusable but irrelevant ones in the retrieval stage.
- Practical for production but with costs: The approach integrates into existing two-tower pipelines but requires additional compute for LLM clustering, making it best suited for domains with rich semantic item data.
- Addresses a known industry pain point: Hard negative sampling has been a persistent challenge in large-scale recommendation systems, and this paper offers a principled, adaptive solution.