Research2026-06-24

CAVEWOMAN: How Large Language Models Behave Under Linguistic Input and Output Compression

arXiv:2606.24083v1 Announce Type: cross Abstract: "Talk short. Drop grammar. Save token." This caveman style is widely promoted as a way to cut inference cost, but whether it actually saves anything depends on which channel (the user's prompt or the model's response) is being compressed. We present...

The recent preprint "CAVEWOMAN" from arXiv takes a rigorous, empirical look at a practice that has become almost folk wisdom in the AI community: the “caveman” prompt. The idea is simple—strip articles, drop grammar, use terse keywords—to reduce token counts and thus lower inference costs. The paper’s authors challenge this assumption by asking a critical question: where does the compression actually occur?

The research systematically tests the impact of compressing both the user’s input (the prompt) and the model’s output (the response). The core finding is that the cost savings are highly asymmetric. Aggressively compressing the prompt—using the “caveman” style—often yields minimal token reduction relative to the total conversation length, especially once the model generates a verbose response. Conversely, compressing the output by instructing the model to be brief can produce significant savings, but at the risk of degrading response quality, coherence, and accuracy. The paper introduces a formal framework to measure these trade-offs, showing that the optimal strategy depends heavily on the ratio of input to output tokens in a given task.

Why This Matters

This work is important because it punctures a common but unexamined optimization heuristic. Many practitioners and API users have adopted “caveman” prompting as a default cost-saving measure, often without measuring its actual impact. The paper’s data suggests that for many real-world use cases—particularly those involving long-form generation, analysis, or reasoning—the token savings from a compressed prompt are negligible compared to the output cost. More critically, the study finds that forcing a model to “talk short” can reduce the quality of complex reasoning chains, potentially leading to more errors that require costly retries or human oversight.

The research also highlights a deeper issue: the conflation of prompt engineering with cost optimization. While concise prompts can improve clarity and reduce ambiguity, they are not a reliable substitute for proper output length control. The paper provides a more nuanced toolkit: it suggests that practitioners should profile their specific workload’s input/output ratio before applying compression, and that output compression should be handled via explicit system instructions (e.g., “limit your response to 50 words”) rather than by degrading the prompt’s linguistic structure.

Implications for AI Practitioners

For developers and engineers deploying LLMs at scale, the takeaway is clear: stop applying compression blindly. The paper recommends a two-step approach. First, measure the token distribution of your typical queries and responses. If your use case is prompt-heavy (e.g., few-shot classification with long examples), input compression may help. If it is output-heavy (e.g., summarization, report generation), focus on output length constraints. Second, test the quality impact of any compression strategy. The paper’s methodology provides a template for doing this systematically, using metrics like task accuracy and response coherence alongside token cost.

The “CAVEWOMAN” paper is a timely corrective. It reminds the field that not all optimization heuristics are created equal, and that the cheapest token is not always the best token.

Key Takeaways

Aggressive “caveman” prompt compression often yields minimal cost savings because the output typically dominates total token usage.
Compressing the model’s output (e.g., via length constraints) can reduce costs but may degrade reasoning quality and accuracy.
Practitioners should profile their specific input/output token ratio before choosing a compression strategy.
The paper provides a formal framework for measuring the cost-quality trade-off of linguistic compression, enabling data-driven optimization.

Read Original Article on Arxiv CS.AI

arxivpapers