Scalable Extraction of Training Data from Aligned, Production Language Models

0citations
0
Citations
#1486
in ICLR 2025
of 3827 papers
10
Authors
4
Data Points

Abstract

Large language models are prone tomemorizingsome of their training data. Memorized (and possibly sensitive) samples can then be extracted at generation time by adversarial or benign users. There is hope thatmodel alignment---a standard training process that tunes a model to harmlessly follow user instructions---would mitigate the risk of extraction. However, we develop two novel attacks that undo a language model's alignment and recover thousands of training examples from popular proprietary aligned models such as OpenAI's ChatGPT. Our work highlights the limitations of existing safeguards to prevent training data leakage in production language models.

Citation History

Jan 24, 2026
0
Jan 26, 2026
0
Jan 26, 2026
0
Jan 28, 2026
0