"model alignment" Papers

24 papers found

An Auditing Test to Detect Behavioral Shift in Language Models

Leo Richter, Xuanli He, Pasquale Minervini et al.

ICLR 2025oralarXiv:2410.19406
2
citations

Anyprefer: An Agentic Framework for Preference Data Synthesis

Yiyang Zhou, Zhaoyang Wang, Tianle Wang et al.

ICLR 2025arXiv:2504.19276
11
citations

Composition and Alignment of Diffusion Models using Constrained Learning

Shervin Khalafi, Ignacio Hounie, Dongsheng Ding et al.

NEURIPS 2025arXiv:2508.19104
2
citations

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

ICLR 2025arXiv:2410.01257
112
citations

Improving Large Vision and Language Models by Learning from a Panel of Peers

Jefferson Hernandez, Jing Shi, Simon Jenni et al.

ICCV 2025arXiv:2509.01610
1
citations

Information Theoretic Text-to-Image Alignment

Chao Wang, Giulio Franzese, alessandro finamore et al.

ICLR 2025arXiv:2405.20759
4
citations

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li et al.

ICLR 2025arXiv:2406.14393
11
citations

Large Language Models Assume People are More Rational than We Really are

Ryan Liu, Jiayi Geng, Joshua Peterson et al.

ICLR 2025arXiv:2406.17055
43
citations

LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong, Xiyao Wang, Dong Guo et al.

CVPR 2025arXiv:2410.02712
103
citations

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Yushi Bai, Jiajie Zhang, Xin Lv et al.

ICLR 2025arXiv:2408.07055
103
citations

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Zhangchen Xu, Fengqing Jiang, Luyao Niu et al.

ICLR 2025arXiv:2406.08464
276
citations

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen, Zichen Wen, Yichao Du et al.

NEURIPS 2025arXiv:2407.04842
60
citations

RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

Hanyang Zhao, Genta Winata, Anirban Das et al.

ICLR 2025arXiv:2410.04203
19
citations

SAS: Segment Any 3D Scene with Integrated 2D Priors

Zhuoyuan Li, Jiahao Lu, Jiacheng Deng et al.

ICCV 2025arXiv:2503.08512
2
citations

Scalable Ranked Preference Optimization for Text-to-Image Generation

Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata et al.

ICCV 2025arXiv:2410.18013
23
citations

Self-Boosting Large Language Models with Synthetic Preference Data

Qingxiu Dong, Li Dong, Xingxing Zhang et al.

ICLR 2025arXiv:2410.06961
32
citations

The Best Instruction-Tuning Data are Those That Fit

Dylan Zhang, Qirun Dai, Hao Peng

NEURIPS 2025spotlightarXiv:2502.04194
23
citations

Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

Wenhong Zhu, Zhiwei He, Xiaofeng Wang et al.

ICLR 2025arXiv:2410.18640
15
citations

Active Preference Learning for Large Language Models

William Muldrew, Peter Hayes, Mingtian Zhang et al.

ICML 2024arXiv:2402.08114
46
citations

Bridge Past and Future: Overcoming Information Asymmetry in Incremental Object Detection

QIJIE MO, Yipeng Gao, Shenghao Fu et al.

ECCV 2024arXiv:2407.11499
14
citations

Learning and Forgetting Unsafe Examples in Large Language Models

Jiachen Zhao, Zhun Deng, David Madras et al.

ICML 2024oralarXiv:2312.12736
25
citations

Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff et al.

ICML 2024spotlightarXiv:2402.01306
871
citations

Recovering the Pre-Fine-Tuning Weights of Generative Models

Eliahu Horwitz, Jonathan Kahana, Yedid Hoshen

ICML 2024arXiv:2402.10208
13
citations

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Kirchner et al.

ICML 2024arXiv:2312.09390
406
citations