Skip to content
BeClaude
Research2026-06-30

SWE-Together: Evaluating Coding Agents in Interactive User Sessions

Originally published byArxiv CS.AI

arXiv:2606.29957v1 Announce Type: cross Abstract: Most coding-agent benchmarks are static: an agent receives a complete task description up front and is judged only by its final code. Real coding assistance is interactive, with users clarifying goals, adding constraints, and correcting mistakes...

The Shift from Static to Interactive Benchmarks

A new research paper, "SWE-Together: Evaluating Coding Agents in Interactive User Sessions," addresses a fundamental blind spot in current AI coding evaluation. Most existing benchmarks—like SWE-bench or HumanEval—test agents in a static, one-shot manner: give them a complete specification, let them generate code, and score the output. This bears little resemblance to how developers actually use AI coding assistants. In real workflows, users iteratively clarify requirements, add constraints mid-task, and correct the agent's misunderstandings. SWE-Together introduces a dynamic evaluation framework that simulates this interactive process, requiring agents to handle multi-turn conversations where the task evolves based on user feedback.

Why This Matters

The gap between static benchmarks and real-world performance has significant consequences. A coding agent that scores 90% on SWE-bench might still be unusable in practice if it cannot adapt when a user says "actually, make that function async" or "I forgot to mention the input needs validation." SWE-Together's methodology forces agents to demonstrate not just code generation ability, but conversational grounding—the capacity to track changing context, ask clarifying questions, and recover from errors. This mirrors the human pair-programming experience far more accurately.

For the AI industry, this represents a maturation of evaluation methodology. Early LLM benchmarks focused on isolated capabilities (reasoning, knowledge recall). Then came agentic benchmarks testing multi-step tasks. Now we are seeing the emergence of interactive benchmarks that test collaboration. This progression suggests that the next frontier for coding agents is not raw coding skill, but human-AI interaction design.

Implications for AI Practitioners

For developers building coding agents: The SWE-Together framework provides a more realistic testing ground. Teams should consider supplementing static benchmarks with interactive evaluations that measure: (1) how well the agent tracks conversation history, (2) its ability to ask for clarification when instructions are ambiguous, and (3) its robustness to mid-task requirement changes. Tools like this can reveal failure modes that static tests miss—for instance, agents that confidently overwrite previous work when given new constraints. For product managers and UX designers: This research underscores that the user experience of coding agents depends heavily on interaction design. An agent that requires perfectly specified prompts to function is not ready for production. Features like undo, confirmation dialogs, and explicit state tracking become critical. The benchmark's structure suggests that successful products will need to support "conversational debugging"—where users can point to specific lines of generated code and say "fix this part." For the research community: SWE-Together highlights the need for more dynamic evaluation datasets. Static benchmarks are cheaper to create but may be actively misleading about real-world performance. Future work should explore whether performance on interactive benchmarks correlates better with user satisfaction than current static metrics.

Key Takeaways

  • SWE-Together introduces a new evaluation paradigm that tests coding agents in multi-turn, interactive sessions where task requirements evolve based on user input, unlike static one-shot benchmarks.
  • The research reveals a critical gap: agents that excel on static benchmarks may fail in real-world scenarios requiring conversational adaptation and error recovery.
  • For practitioners, this means prioritizing interaction design features (undo, clarification requests, state tracking) and incorporating interactive testing into development pipelines.
  • The shift toward interactive benchmarks signals that the coding agent market is maturing from raw capability competition to usability and collaboration quality.
arxivpapersagents