Research2026-07-01

Loc2Repair: A Framework for Evaluating the Impact of File-Level Issue Localization in Repo-Level LLM Repair

Originally published byArxiv CS.AI

arXiv:2606.30963v1 Announce Type: cross Abstract: Repository-grounded automated repair is often reported as a single end-to-end capability, which hides distinct failure modes such as poor file targeting, incorrect patch synthesis, and failed iterative debugging. We present Loc2Repair, a modular...

Deconstructing the Black Box of Automated Repair

A new research paper, "Loc2Repair," introduces a modular evaluation framework that systematically separates the file localization step from the patch synthesis step in repository-level automated program repair. By isolating these components, the framework reveals that many reported end-to-end repair successes are misleading—they often mask failures in file targeting that are then compensated for by other parts of the pipeline. This decomposition allows researchers to pinpoint exactly where a system breaks down: whether it fails to identify the correct file, generates an incorrect patch, or struggles with iterative debugging.

Why This Matters

The automated repair field has been conflating two fundamentally different capabilities: finding the right file and fixing the bug within it. Current benchmarks treat repair as a single pass/fail metric, which obscures critical failure modes. For example, a system might achieve high repair rates simply because it has a strong file retriever, while its actual patch generation is weak—or vice versa. Loc2Repair’s modular approach provides a much-needed diagnostic tool that can attribute performance gains to specific components, enabling targeted improvements rather than blind optimization.

This is particularly important as AI-assisted coding tools move from single-file fixes to repository-level repairs, where the search space expands dramatically. Without this decomposition, practitioners cannot know whether a model is genuinely reasoning about code or merely pattern-matching on file names and function signatures. The framework also highlights the iterative nature of real debugging—many repairs require multiple attempts, and current metrics often ignore this process entirely.

Implications for AI Practitioners

For developers building automated repair systems, Loc2Repair offers a practical methodology for stress-testing each pipeline stage independently. If your system fails on a repository-level task, you can now determine whether the bottleneck is in retrieval, synthesis, or validation. This shifts the debugging process from guesswork to data-driven optimization.

For teams evaluating third-party repair tools, the framework provides a more honest assessment of capability. A vendor claiming 80% repair success might actually have 90% file localization accuracy but only 50% patch correctness—a critical distinction when deciding whether to trust the tool in production.

The research also underscores a broader lesson: as AI systems become more complex, we must resist the temptation to report aggregate metrics that hide failure modes. The modular evaluation philosophy behind Loc2Repair could extend beyond code repair to other multi-step AI tasks, such as question answering over documents or multi-turn code generation.

Key Takeaways

Isolate failure modes: Loc2Repair shows that end-to-end repair metrics conflate file localization and patch synthesis, hiding which component actually fails.
Targeted debugging: Practitioners can now attribute performance issues to specific pipeline stages, enabling more efficient optimization.
Honest benchmarking: The framework provides a more transparent evaluation method for comparing automated repair systems.
Broader applicability: The modular evaluation approach could be adapted to other multi-step AI tasks where intermediate failures are currently masked by aggregate scores.

Read Original Article on Arxiv CS.AI

arxivpapers