Paper "activation steering" Papers
2 papers found
Conference
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
Zouying Cao, Yifei Yang, Hai Zhao
AAAI 2025paperarXiv:2408.11491
23
citations
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought'' Control
Hannah Cyberey, David Evans
COLM 2025paper
10
citations