SWE-Doctor: Guiding Software Engineering Agents with Runtime Diagnosis from Multi-Faceted Bug Reproduction Tests
arXiv:2607.00990v1 Announce Type: cross Abstract: Large language model (LLM)-based software engineering agents are increasingly developed to resolve software issues by generating patches from issue reports and code repositories. Bug reproduction tests (BRTs) are an important building block for such...
What Happened
Researchers have introduced SWE-Doctor, a novel framework that enhances LLM-based software engineering agents by integrating runtime diagnosis from multi-faceted bug reproduction tests (BRTs). The approach moves beyond static code analysis by having agents execute test cases that reproduce bugs, then use the runtime behavior—including stack traces, variable states, and execution paths—to guide patch generation. This represents a shift from treating BRTs merely as validation tools to using them as diagnostic instruments that inform the entire debugging pipeline.
Why It Matters
Current LLM-based coding agents often rely on superficial pattern matching from training data, leading to patches that appear correct syntactically but fail semantically. SWE-Doctor addresses a fundamental limitation: the inability of many agents to understand why a bug occurs, not just where it occurs. By feeding runtime diagnostics back into the agent's reasoning loop, the system can distinguish between root causes and correlated symptoms, reducing false fixes.
This is particularly significant for complex, multi-file bugs where the error manifests in one location but originates elsewhere. Traditional agents might patch the symptom while leaving the root cause intact. SWE-Doctor's multi-faceted BRTs—which include unit tests, integration tests, and regression tests—provide a richer signal for the agent to triangulate the actual defect.
The approach also tackles the "overfitting" problem common in LLM-generated patches, where agents produce code that passes specific test cases but breaks other functionality. By using runtime diagnosis to verify that patches don't introduce new failures across the test suite, SWE-Doctor improves patch robustness.
Implications for AI Practitioners
For teams building or using AI coding assistants, SWE-Doctor highlights several actionable insights:
- Test infrastructure matters more than model size. The framework's gains come from better use of existing test suites, not from larger models. Practitioners should invest in comprehensive BRTs before chasing the latest LLM release.
- Runtime feedback loops are underutilized. Most current agents operate on static code representations. Integrating execution traces, even from simple test runs, can dramatically improve diagnostic accuracy without requiring architectural changes to the underlying LLM.
- Multi-faceted testing is a force multiplier. A single unit test provides limited signal; combining reproduction tests at different granularities gives agents the context needed to distinguish surface-level bugs from systemic issues.
- Evaluation metrics need updating. SWE-Doctor suggests that patch pass rates on reproduction tests alone are insufficient. Practitioners should measure whether patches also pass broader regression suites and avoid introducing new bugs—metrics that runtime diagnosis naturally enables.
Key Takeaways
- SWE-Doctor uses runtime execution data from multi-faceted bug reproduction tests to guide LLM agents toward more accurate patches, moving beyond static code analysis.
- The framework reduces overfitting and misdiagnosis by feeding stack traces, variable states, and test outcomes back into the agent's reasoning process.
- For AI practitioners, investing in comprehensive test suites and runtime feedback loops may yield greater improvements than upgrading to larger models.
- The approach underscores that effective AI-assisted debugging requires not just code generation capability, but structured diagnostic infrastructure that mirrors human debugging workflows.