Who Gets the Reward & Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents
arXiv:2511.10687v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent- and message-level learning. We propose a...
The Credit Assignment Problem in Multi-Agent LLM Systems
A new preprint from arXiv tackles one of the most stubborn challenges in multi-agent AI systems: how to properly attribute credit and blame when multiple LLMs collaborate. The paper proposes a framework for connecting high-level system evaluations—did the team succeed or fail?—down to the individual agent actions and even specific messages that contributed to the outcome.
Currently, training multi-agent LLM systems is a blunt instrument. Developers often treat the entire agent team as a black box, applying reinforcement learning signals uniformly or relying on hand-crafted heuristics to adjust individual agents. This approach fails when a single misleading message from one agent causes the whole system to derail, or when a brilliant insight from a subordinate agent goes unrewarded because the final output was mediocre.
The proposed method introduces a principled way to propagate evaluation signals backward through the agent communication graph. By decomposing system-level rewards into per-agent and per-message contributions, the framework enables more granular training. This is conceptually similar to how gradient backpropagation works in neural networks, but applied to the discrete, symbolic interactions between LLM agents.
Why This Matters
The implications are significant for several reasons. First, it addresses a fundamental bottleneck in scaling multi-agent systems. As organizations deploy teams of specialized LLMs for tasks like software development, legal analysis, or scientific research, the ability to fine-tune individual agents based on collective outcomes becomes critical. Without this capability, debugging and improving multi-agent systems remains an art rather than a science.
Second, the approach could reduce the data and compute requirements for training effective multi-agent systems. Instead of needing extensive labeled data for every possible agent interaction, developers could leverage system-level outcomes—which are often easier to obtain—to generate training signals for individual components.
Third, this work highlights a growing recognition that multi-agent LLM systems are not just about better prompting or tool use. They require fundamentally new training paradigms that account for emergent behaviors, information bottlenecks, and coordination failures that don't exist in single-agent settings.
Implications for AI Practitioners
For engineers building multi-agent systems today, this research suggests several practical considerations. Teams should design their agent architectures with credit assignment in mind—maintaining clear message provenance, logging intermediate outputs, and structuring agent roles to make attribution tractable. The paper also implies that current approaches using uniform reward signals across all agents may be leaving significant performance on the table.
However, practitioners should note that this is still early-stage research. The computational overhead of computing per-message credit assignments, especially in systems with many agents and long conversation histories, could be substantial. Additionally, the framework assumes a relatively stable agent topology, which may not hold in dynamic systems where agents are added or removed mid-task.
Key Takeaways
- Granular credit assignment is the next frontier for multi-agent LLM systems, moving beyond treating agent teams as monolithic black boxes during training.
- System-level outcomes can be decomposed into per-agent and per-message training signals, enabling more targeted improvements without requiring exhaustive labeled data.
- Architecture matters for trainability—practitioners should design agent communication patterns and logging systems that facilitate future credit attribution.
- The approach is promising but unproven at scale; computational costs and dynamic agent topologies remain open challenges for real-world deployment.