"interpretability" Papers

12 papers found

Enhancing Uncertainty Estimation and Interpretability with Bayesian Non-negative Decision Layer

XINYUE HU, Zhibin Duan, Bo Chen et al.

ICLR 2025posterarXiv:2505.22199
2
citations

Interpreting the linear structure of vision-language model embedding spaces

Isabel Papadimitriou, Huangyuan Su, Thomas Fel et al.

COLM 2025paperarXiv:2504.11695
14
citations

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs

Jacob Dunefsky, Arman Cohan

COLM 2025paperarXiv:2502.18862
8
citations

On Mechanistic Circuits for Extractive Question-Answering

Samyadeep Basu, Vlad I Morariu, Ryan A. Rossi et al.

COLM 2025paperarXiv:2502.08059

On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions

Dang Nguyen, Chenhao Tan

COLM 2025paperarXiv:2504.06303
3
citations

Probing then Editing Response Personality of Large Language Models

Tianjie Ju, Zhenyu Shao, Bowen Wang et al.

COLM 2025paperarXiv:2504.10227
3
citations

The Dual-Route Model of Induction

Sheridan Feucht, Eric Todd, Byron C Wallace et al.

COLM 2025paperarXiv:2504.03022
15
citations

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025posterarXiv:2405.08366
63
citations

Truth-value judgment in language models: ‘truth directions’ are context sensitive

Stefan F. Schouten, Peter Bloem, Ilia Markov et al.

COLM 2025paper

UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?

Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron et al.

COLM 2025paper

CF-OPT: Counterfactual Explanations for Structured Prediction

Germain Vivier-Ardisson, Alexandre Forel, Axel Parmentier et al.

ICML 2024posterarXiv:2405.18293

Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction

Monika Jain, Raghava Mutharaju, Ramakanth Kavuluru et al.

AAAI 2024paperarXiv:2401.11800
17
citations