Research2026-06-30

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Originally published byArxiv CS.AI

arXiv:2511.06090v3 Announce Type: replace-cross Abstract: Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather...

What Happened

This research paper introduces a new benchmark for evaluating language models on real-world software optimization tasks. Unlike prior benchmarks that focus on bug detection or code completion, this work tests whether LLMs can optimize existing code repositories for performance—reducing runtime while maintaining correctness. The authors created workloads from actual repositories and measured models’ ability to identify bottlenecks, propose efficient alternatives, and generate patches that pass existing test suites.

The key innovation is the shift from “what to fix” (bug localization) to “how to improve” (performance optimization). The benchmark includes diverse repositories with real computational workloads, requiring models to understand both the codebase structure and the runtime characteristics of specific functions.

Why It Matters

This research addresses a critical gap in AI-assisted software engineering. Current LLM benchmarks overwhelmingly focus on correctness—can the model generate code that compiles and passes tests? But professional software engineering demands more: code that runs efficiently under real-world conditions. A model that can write correct but slow code has limited utility for production systems.

The implications are significant for three reasons:

First, performance optimization requires deeper reasoning than bug fixing. Models must understand algorithmic complexity, data structure trade-offs, and system-level interactions—not just syntax. This benchmark reveals whether current models possess genuine engineering judgment or merely pattern-match common optimizations. Second, the focus on real repositories with existing test suites creates a more realistic evaluation. Many benchmarks use synthetic problems or isolated functions, which don’t capture the complexity of optimizing code that interacts with other modules, databases, or external APIs. Third, this work highlights the gap between “code generation” and “code improvement.” Most LLM applications today generate new code from scratch. But in practice, engineers spend more time refactoring and optimizing existing codebases. A model that can efficiently optimize legacy code would be far more valuable than one that writes clean code from scratch.

Implications for AI Practitioners

For teams building AI-assisted development tools, this research suggests several practical considerations:

Evaluation metrics must evolve. Accuracy on code generation benchmarks doesn’t predict performance optimization capability. Teams should develop separate benchmarks for optimization tasks, ideally using their own codebases with real performance constraints.

Context window limitations remain a bottleneck. Optimizing a large repository requires understanding the full codebase structure, not just the target function. Current models struggle with the context needed for holistic optimization.

Safety and correctness verification is harder for optimizations. A bug fix either works or doesn’t; an optimization can be correct but suboptimal, or correct for most cases but fail on edge cases. Practitioners need better validation pipelines for performance changes.

Domain-specific fine-tuning may be necessary. General-purpose LLMs show limited ability to optimize code for specific hardware architectures, database systems, or latency constraints. Specialized models trained on optimization traces could outperform general models.

Key Takeaways

This benchmark shifts focus from code correctness to code performance, revealing that current LLMs struggle with real-world optimization tasks requiring deep engineering reasoning
Performance optimization is fundamentally harder than bug fixing for AI systems, requiring understanding of algorithmic complexity and system interactions beyond syntax
AI practitioners should develop separate evaluation pipelines for optimization tasks and invest in context-handling improvements for large codebases
Domain-specific fine-tuning on optimization traces may be necessary before LLMs can reliably improve production code performance

Read Original Article on Arxiv CS.AI

arxivpapers