Research2026-05-06
Automated Interpretability and Feature Discovery in Language Models with Agents
Source: Arxiv CS.AI
arXiv:2605.01555v1 Announce Type: cross Abstract: We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent...
arxivpapersagents