2025 "model interpretability" Papers

16 papers found

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Fengyuan Liu, Nikhil Kandpal, Colin Raffel

ICLR 2025posterarXiv:2411.15102
12
citations

Cognitive Mirrors: Exploring the Diverse Functional Roles of Attention Heads in LLM Reasoning

Xueqi Ma, Jun Wang, Yanbei Jiang et al.

NeurIPS 2025posterarXiv:2512.10978
1
citations

Concept Bottleneck Language Models For Protein Design

Aya Ismail, Tuomas Oikarinen, Amy Wang et al.

ICLR 2025posterarXiv:2411.06090
13
citations

Data-centric Prediction Explanation via Kernelized Stein Discrepancy

Mahtab Sarvmaili, Hassan Sajjad, Ga Wu

ICLR 2025posterarXiv:2403.15576
2
citations

Dataset Distillation for Pre-Trained Self-Supervised Vision Models

George Cazenavette, Antonio Torralba, Vincent Sitzmann

NeurIPS 2025posterarXiv:2511.16674

Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning

Chen Qian, Dongrui Liu, Hao Wen et al.

NeurIPS 2025arXiv:2506.02867
22
citations

Discovering Influential Neuron Path in Vision Transformers

Yifan Wang, Yifei Liu, Yingdong Shi et al.

ICLR 2025posterarXiv:2503.09046
4
citations

From Search to Sampling: Generative Models for Robust Algorithmic Recourse

Prateek Garg, Lokesh Nagalapatti, Sunita Sarawagi

ICLR 2025posterarXiv:2505.07351
2
citations

LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Zhuo Cao, Xuan Zhao, Lena Krieger et al.

NeurIPS 2025posterarXiv:2510.14623
1
citations

Looking Inward: Language Models Can Learn About Themselves by Introspection

Felix Jedidja Binder, James Chua, Tomek Korbak et al.

ICLR 2025oralarXiv:2410.13787
40
citations

Manipulating Feature Visualizations with Gradient Slingshots

Dilyara Bareeva, Marina Höhne, Alexander Warnecke et al.

NeurIPS 2025posterarXiv:2401.06122
6
citations

Register and [CLS] tokens induce a decoupling of local and global features in large ViTs

Alexander Lappe, Martin Giese

NeurIPS 2025poster

SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries

Darin Tsui, Aryan Musharaf, Yigit Efe Erginbas et al.

NeurIPS 2025posterarXiv:2410.19236
2
citations

Smoothed Differentiation Efficiently Mitigates Shattered Gradients in Explanations

Adrian Hill, Neal McKee, Johannes Maeß et al.

NeurIPS 2025poster

Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima et al.

NeurIPS 2025posterarXiv:2506.05744
13
citations

Unveiling Concept Attribution in Diffusion Models

Nguyen Hung-Quang, Hoang Phan, Khoa D Doan

NeurIPS 2025posterarXiv:2412.02542
4
citations