Research2026-05-06

Automated Interpretability and Feature Discovery in Language Models with Agents

arXiv:2605.01555v1 Announce Type: cross Abstract: We introduce an autonomous multiagent framework for mechanistic interpretability that automates both explaining and finding internal features in large language models. The system runs two coupled loops: (1) explanation refinement, where an agent...

Read Original Article on Arxiv CS.AI

arxivpapersagents