SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

1citations
1
Citations
#1167
in NeurIPS 2025
of 5858 papers
7
Authors
3
Data Points

Abstract

Watermarking LLM-generated text is critical for content attribution and misinformation prevention, yet existing methods compromise text quality and require white-box model access with logit manipulation or training, which exclude API-based models and multilingual scenarios. We propose SAEMark, aninference-time frameworkformulti-bitwatermarking that embeds personalized information throughfeature-based rejection sampling, fundamentally different from logit-based or rewriting-based approaches: wedo not modify model outputs directlyand require onlyblack-box access, while naturally supporting multi-bit message embedding and generalizing across diverse languages and domains. We instantiate the framework usingSparse Autoencodersas deterministic feature extractors and provide theoretical worst-case analysis relating watermark accuracy to computational budget. Experiments across 4 datasets demonstrate strong watermarking performance on English, Chinese, and code while preserving text quality. SAEMark establishes a new paradigm forscalable, quality-preserving watermarksthat work seamlessly with closed-source LLMs across languages and domains.

Citation History

Jan 26, 2026
0
Jan 26, 2026
1+1
Jan 27, 2026
1