"activation patching" Papers
2 papers found
Mechanistic Interpretability Meets Vision Language Models: Insights and Limitations
Yiming Liu, Yuhui Zhang, Serena Yeung
ICLR 2025poster
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NeurIPS 2025posterarXiv:2406.14144
24
citations