SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference
arXiv:2606.26587v1 Announce Type: cross Abstract: Low-bit floating-point formats and semi-structured sparsity are increasingly supported by modern accelerators, yet combining them for LLM activation compression remains challenging: activations contain input-dependent outliers that dominate block...
What Happened
Researchers have introduced SharQ, a novel method that successfully combines activation sparsity with FP4 quantization for large language model inference. The paper addresses a persistent bottleneck: while modern hardware increasingly supports low-bit floating-point formats (like FP4) and semi-structured sparsity patterns, applying both simultaneously to LLM activations has been problematic. The core challenge is that activations contain input-dependent outliers—values significantly larger than their neighbors—which dominate the quantization error and degrade model quality when compressed aggressively.
SharQ proposes a technique to identify and isolate these outlier activations, applying a mixed-precision approach where critical outliers are preserved in higher precision while the majority of activations are compressed to FP4 with 2:4 semi-structured sparsity. This selective treatment allows the method to achieve substantial memory bandwidth savings—reportedly up to 4× reduction in activation memory footprint—without the catastrophic accuracy drops that naive joint compression would cause.
Why It Matters
This development is significant for three reasons. First, LLM inference is increasingly memory-bound rather than compute-bound, meaning the bottleneck is often moving data between memory and processing units rather than performing calculations. Activation compression directly attacks this bottleneck by reducing the data that must be transferred. Second, the combination of sparsity and low-precision quantization has been a "holy grail" because each technique alone offers limited gains, but together they could theoretically multiply savings. Previous attempts, however, found that outliers in activations made this combination impractical—SharQ appears to have found a practical workaround.
Third, the timing aligns with hardware trends. NVIDIA's Hopper and Blackwell architectures, as well as AMD's CDNA3, have dedicated support for FP8 and structured sparsity. FP4 support is emerging in research prototypes and next-generation designs. SharQ anticipates a hardware landscape where both features coexist, providing a ready-made compression strategy for future accelerators.
Implications for AI Practitioners
For engineers deploying LLMs in production, SharQ suggests a path to significantly reduce serving costs and latency, particularly for long-context applications where activation memory dominates. The method is likely most impactful for models with very large hidden dimensions (e.g., 8K+), where activation memory dwarfs weight memory. Practitioners should note that the technique requires careful calibration on representative data to identify outlier channels—this adds a preprocessing step but is standard practice for quantization today.
However, the practical adoption depends on hardware support. Current accelerators do not natively support FP4 computation, so early implementations would likely rely on software emulation or dequantization to FP8/FP16, limiting real-world speedups. The true payoff will come when next-generation hardware natively executes FP4 operations with sparsity patterns—something that may appear in 2026-2027 GPU architectures.
Key Takeaways
- SharQ enables joint activation sparsity and FP4 quantization by selectively preserving outlier activations in higher precision, avoiding the accuracy collapse that previously plagued such combinations.
- The technique can reduce activation memory footprint by up to 4×, directly addressing the memory-bound nature of LLM inference, especially for long-context workloads.
- Practical deployment currently depends on future hardware support for native FP4 computation; early adopters will need software workarounds that may limit throughput gains.
- The method adds a calibration step to identify outlier channels, but this is a manageable preprocessing cost similar to existing quantization workflows.