Concept Bottleneck Large Language Models

22citations

arXiv:2412.07992 Project

Citations

#485

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Chung-En Sun Tuomas Oikarinen Berk Ustun Tsui-Wei Weng

Topics

concept bottleneck models interpretable large language models text classification text generation controlled generation harmful content detection concept detection model unlearning

Abstract

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. Our code is available at https://github.com/Trustworthy-ML-Lab/CB-LLMs.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

22+22