Wayward Concepts In Multimodal Models

0citations

Citations

#2008

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Brandon Trabucco Max Gurinas Kyle Doherty Russ Salakhutdinov

Abstract

Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after optimizing just the prompt. How are prompt embeddings for visual concepts found by prompt tuning methods different from typical discrete prompts? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that prompts optimized to represent new visual concepts are akin to an adversarial attack on the text encoder. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prompt that reprogram models to generate, detect, and classify arbitrary subjects. These perturbations target the final-layers in text encoders, and steer pooling tokens towards the subject. We explore the transferability of these prompts, and find that perturbations reprogramming multimodal models are initialization-specific, and model-specific. Code for reproducing our work is available at the following site: https://wayward-concepts.github.io.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026