ADAG: Automatically Describing Attribution Graphs
View PDF HTML (experimental) Abstract:In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct. Subjects: Computation and Language (cs.CL) ACM classes: I.2.7 Cite as: arXiv:2604.07615 [cs.CL] (or arXiv:2604.07615v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.07615 arXiv-issued DOI via DataCite (pending registration) Submission history From: Aryaman Arora [view email] [v1] Wed, 8 Apr 2026 21:34:37 UTC (2,880 KB)
No replies yet. Be first.