"interpretability" Papers
12 papers found
Conference
Enhancing Uncertainty Estimation and Interpretability with Bayesian Non-negative Decision Layer
XINYUE HU, Zhibin Duan, Bo Chen et al.
ICLR 2025posterarXiv:2505.22199
2
citations
Interpreting the linear structure of vision-language model embedding spaces
Isabel Papadimitriou, Huangyuan Su, Thomas Fel et al.
COLM 2025paperarXiv:2504.11695
14
citations
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
Jacob Dunefsky, Arman Cohan
COLM 2025paperarXiv:2502.18862
8
citations
On Mechanistic Circuits for Extractive Question-Answering
Samyadeep Basu, Vlad I Morariu, Ryan A. Rossi et al.
COLM 2025paperarXiv:2502.08059
On the Effectiveness and Generalization of Race Representations for Debiasing High-Stakes Decisions
Dang Nguyen, Chenhao Tan
COLM 2025paperarXiv:2504.06303
3
citations
Probing then Editing Response Personality of Large Language Models
Tianjie Ju, Zhenyu Shao, Bowen Wang et al.
COLM 2025paperarXiv:2504.10227
3
citations
The Dual-Route Model of Induction
Sheridan Feucht, Eric Todd, Byron C Wallace et al.
COLM 2025paperarXiv:2504.03022
15
citations
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
ICLR 2025posterarXiv:2405.08366
63
citations
Truth-value judgment in language models: ‘truth directions’ are context sensitive
Stefan F. Schouten, Peter Bloem, Ilia Markov et al.
COLM 2025paper
UNVEILING: What Makes Linguistics Olympiad Puzzles Tricky for LLMs?
Mukund Choudhary, KV Aditya Srivatsa, Gaurja Aeron et al.
COLM 2025paper
CF-OPT: Counterfactual Explanations for Structured Prediction
Germain Vivier-Ardisson, Alexandre Forel, Axel Parmentier et al.
ICML 2024posterarXiv:2405.18293
Revisiting Document-Level Relation Extraction with Context-Guided Link Prediction
Monika Jain, Raghava Mutharaju, Ramakanth Kavuluru et al.
AAAI 2024paperarXiv:2401.11800
17
citations