CARE: Decoding-Time Safety Alignment via Rollback and Introspection Intervention

0citations

arXiv:2509.06982

citations

#2219

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Xiaomeng Hu Fei Huang Chenhan Yuan Junyang Lin Tsung-Yi Ho

Abstract

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we proposeCARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining alow harmful response rateandminimal disruption to the user experiencewhilemaintaining high response quality.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026