Industry2026-06-25

I was curious why MTP affects PP TPS in llama.cpp. My PoC recovers it?

I've been running Qwen3.6-35B-A3B locally on llama.cpp and noticed that prompt processing throughput gets too low with MTP. I got nerd-sniped.I'm not a C++ dev, I know almost nothing about ML, and I'm only scratching the surface of how LLMs work. What started as curiosity turned into...

The Nerd-Snipe That Exposed MTP's Hidden Cost

A developer's casual curiosity about Multi-Token Prediction (MTP) in llama.cpp has unearthed a performance bottleneck that many AI practitioners may have overlooked. The user noticed that enabling MTP—a technique where models predict multiple future tokens simultaneously—caused a significant drop in prompt processing throughput (PP TPS) when running Qwen3.6-35B-A3B locally. What began as a simple observation escalated into a proof-of-concept recovery attempt, revealing that MTP's implementation in llama.cpp may carry an unexpected computational tax during the prompt processing phase.

What Actually Happened

The developer, self-described as "not a C++ dev" and with limited ML knowledge, identified that MTP's integration into llama.cpp's prompt processing pipeline introduces overhead that degrades throughput. The core issue appears to stem from how MTP interacts with the batch processing of prompt tokens. In standard inference, prompt processing is highly parallelizable—the model processes all tokens in a prompt simultaneously to build the key-value cache. MTP, however, may force additional sequential dependencies or redundant computations during this phase, reducing the effective parallelism. The developer's PoC (Proof of Concept) reportedly recovers some of this lost performance, though the exact mechanism remains under discussion.

Why This Matters

MTP is a promising technique for reducing inference latency, particularly in speculative decoding and draft model scenarios. Models like Qwen3.6-35B-A3B are specifically designed to leverage MTP for faster generation. If MTP inadvertently degrades prompt processing—the phase that accounts for a large portion of latency in interactive applications like chatbots or code assistants—the net benefit of MTP becomes questionable. For practitioners running local models on consumer hardware, every millisecond counts. A hidden throughput penalty in prompt processing could mean longer wait times before the model even begins generating tokens, undermining the very speed advantage MTP promises.

Implications for AI Practitioners

This finding underscores a critical lesson: inference optimization techniques often have hidden trade-offs. MTP's benefits during token generation may come at the cost of slower prompt ingestion. Practitioners should:

Benchmark both phases separately: Measure prompt processing TPS and generation TPS independently when enabling MTP. A speedup in generation may mask a slowdown in prompt processing.
Consider workload characteristics: For applications with long prompts (e.g., document analysis, RAG), the prompt processing penalty could dominate. For short prompts with long generations, MTP's generation speedup may still win.
Monitor llama.cpp updates: The developer's PoC may lead to upstream fixes. Following the llama.cpp repository for MTP-related performance patches is prudent.

Key Takeaways

MTP in llama.cpp can reduce prompt processing throughput, potentially offsetting its generation speed benefits.
The bottleneck likely stems from reduced parallelism during the prompt processing phase when MTP is active.
AI practitioners should benchmark prompt processing and generation separately when evaluating MTP's net impact.
Community-driven fixes may emerge, highlighting the value of open-source scrutiny in uncovering performance regressions.

Read Original Article on Hacker News

hacker-newsllama