🧬Ethics & Safety

Fairness in ML

Fair and unbiased models

100 papers3,744 total citations
Compare with other topics
Feb '24 Jan '26947 papers
Also includes: fairness, fair machine learning, bias mitigation, algorithmic fairness

Top Papers

#1

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick et al.

ICML 2025
329
citations
#2

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng, Xianyuan Zhan et al.

ICLR 2024
309
citations
#3

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang et al.

ICLR 2024
263
citations
#4

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang et al.

ICLR 2025arXiv:2410.02736
llm-as-a-judgebias quantificationautomated evaluation frameworklanguage model evaluation+2
207
citations
#5

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

ICLR 2025arXiv:2410.12784
llm-based judgesevaluation frameworkpreference labelingobjective correctness+4
150
citations
#6

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

ICML 2025
110
citations
#7

Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties

Taylor Sorensen, Liwei Jiang, Jena Hwang et al.

AAAI 2024arXiv:2309.00779
value pluralismhuman values modelingcontextualized value generationmulti-task language model+4
91
citations
#8

Reliable Conflictive Multi-View Learning

Cai Xu, Jiajun Si, Ziyu Guan et al.

AAAI 2024arXiv:2402.16897
multi-view learningconflictive instancesevidential learningopinion aggregation+2
88
citations
#9

Finetuning Text-to-Image Diffusion Models for Fairness

Xudong Shen, Chao Du, Tianyu Pang et al.

ICLR 2024
85
citations
#10

Human Feedback is not Gold Standard

Tom Hosking, Phil Blunsom, Max Bartolo

ICLR 2024
83
citations
#11

Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution

Zhiyuan You, Xin Cai, Jinjin Gu et al.

CVPR 2025
81
citations
#12

OpenBias: Open-set Bias Detection in Text-to-Image Generative Models

Moreno D&#x27, Incà, Elia Peruzzo et al.

CVPR 2024
69
citations
#13

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.

ICLR 2025
63
citations
#14

Position: The No Free Lunch Theorem, Kolmogorov Complexity, and the Role of Inductive Biases in Machine Learning

Micah Goldblum, Marc Finzi, Keefer Rowan et al.

ICML 2024
no free lunch theoremskolmogorov complexityinductive biasessupervised learning+4
60
citations
#15

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.

ICLR 2025
58
citations
#16

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Zhaorun Chen, Zichen Wen, Yichao Du et al.

NeurIPS 2025arXiv:2407.04842
multimodal reward modelstext-to-image generationpreference datasetimage generation models+4
57
citations
#17

A Decade's Battle on Dataset Bias: Are We There Yet?

Zhuang Liu, Kaiming He

ICLR 2025
52
citations
#18

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples

Phillip Howard, Avinash Madasu, Tiep Le et al.

CVPR 2024
48
citations
#19

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

ICML 2025
48
citations
#20

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

Yifei Ming, Senthil Purushwalkam, Shrey Pandit et al.

ICLR 2025
faithfulness hallucinationretrieval-augmented generationcontextual evaluation benchmarkunanswerable context handling+3
45
citations
#21

Debiasing Multimodal Sarcasm Detection with Contrastive Learning

Mengzhao Jia, Can Xie, Liqiang Jing

AAAI 2024arXiv:2312.10493
multimodal sarcasm detectioncontrastive learningout-of-distribution generalizationdebiasing methods+4
43
citations
#22

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

ICLR 2025
42
citations
#23

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Zicheng Zhang, Haoning Wu, Chunyi Li et al.

ICLR 2025
40
citations
#24

No Prejudice! Fair Federated Graph Neural Networks for Personalized Recommendation

Nimesh Agrawal, Anuj Sirohi, Sandeep Kumar et al.

AAAI 2024arXiv:2312.10080
federated learninggraph neural networksrecommendation systemsfairness constraints+3
39
citations
#25

Gradient Reweighting: Towards Imbalanced Class-Incremental Learning

Jiangpeng He

CVPR 2024
39
citations
#26

Spurious Feature Diversification Improves Out-of-distribution Generalization

LIN Yong, Lu Tan, Yifan HAO et al.

ICLR 2024
33
citations
#27

FairSIN: Achieving Fairness in Graph Neural Networks through Sensitive Information Neutralization

Cheng Yang, Jixi Liu, Yunhe Yan et al.

AAAI 2024arXiv:2403.12474
graph neural networksfairness in gnnssensitive information neutralizationbias mitigation+3
33
citations
#28

Training Unbiased Diffusion Models From Biased Dataset

Yeongmin Kim, Byeonghu Na, Minsang Park et al.

ICLR 2024
32
citations
#29

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025arXiv:2501.04931
jailbreak attacksmultimodal large language modelssafety mechanism vulnerabilitiesshuffle inconsistency+4
28
citations
#30

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Benjamin Feuer, Micah Goldblum, Teresa Datta et al.

ICLR 2025
27
citations
#31

FairDomain: Achieving Fairness in Cross-Domain Medical Image Segmentation and Classification

Yu Tian, Congcong Wen, Min Shi et al.

ECCV 2024
26
citations
#32

Unmasking and Improving Data Credibility: A Study with Datasets for Training Harmless Language Models

Zhaowei Zhu, Jialu Wang, Hao Cheng et al.

ICLR 2024
26
citations
#33

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Baichuan Zhou, Haote Yang, Dairong Chen et al.

AAAI 2025
26
citations
#34

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Lijun Li, Zhelun Shi, Xuhao Hu et al.

CVPR 2025
25
citations
#35

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning

Xin Yi, Shunfan Zheng, Linlin Wang et al.

AAAI 2025
25
citations
#36

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Wenhui Zhu, Xiwen Chen, Peijie Qiu et al.

ECCV 2024arXiv:2407.03575
multiple instance learningwhole slide image classificationattention mechanismdiversity modeling+3
24
citations
#37

Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

Xinghe Fu, Zhiyuan Yan, Taiping Yao et al.

AAAI 2025
24
citations
#38

Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation

Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.

ICLR 2025arXiv:2410.03415
false refusal mitigationsingle vector ablationlanguage model safetyrefusal behavior calibration+3
24
citations
#39

Inverse Constitutional AI: Compressing Preferences into Principles

Arduin Findeis, Timo Kaufmann, Eyke Hüllermeier et al.

ICLR 2025
24
citations
#40

Facial Affective Behavior Analysis with Instruction Tuning

Yifan Li, Anh Dao, Wentao Bao et al.

ECCV 2024
23
citations
#41

Would Deep Generative Models Amplify Bias in Future Models?

Tianwei Chen, Yusuke Hirota, Mayu Otani et al.

CVPR 2024
23
citations
#42

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NeurIPS 2025
23
citations
#43

Truthful Aggregation of LLMs with an Application to Online Advertising

Ermis Soumalias, Michael Curry, Sven Seuken

NeurIPS 2025arXiv:2405.05905
auction mechanism designtruthful reportingpreference aggregationonline advertising+3
22
citations
#44

Pathologies of Predictive Diversity in Deep Ensembles

Geoff Pleiss, Taiga Abe, E. Kelly Buchanan et al.

ICLR 2024
21
citations
#45

Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

ICLR 2025arXiv:2511.08594
preference learningkl divergence regularizerllm output diversityalignment algorithms+3
21
citations
#46

Debiasing Algorithm through Model Adaptation

Tomasz Limisiewicz, David Mareček, Tomáš Musil

ICLR 2024
21
citations
#47

Is Your Multimodal Language Model Oversensitive to Safe Queries?

Xirui Li, Hengguang Zhou, Ruochen Wang et al.

ICLR 2025arXiv:2406.17806
multimodal large language modelssafety mechanismscognitive distortionsoversensitivity benchmark+4
20
citations
#48

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

chengqian gao, Haonan Li, Liu Liu et al.

ICML 2025
20
citations
#49

Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rocktäschel et al.

ICML 2025
19
citations
#50

First-Person Fairness in Chatbots

Tyna Eloundou, Alex Beutel, David Robinson et al.

ICLR 2025
19
citations
#51

Weighted Envy-Freeness for Submodular Valuations

Luisa Montanari, Ulrike Schmidt-Kraepelin, Warut Suksompong et al.

AAAI 2024arXiv:2209.06437
fair allocationindivisible goodssubmodular valuationsweighted envy-freeness+4
19
citations
#52

Project-Fair and Truthful Mechanisms for Budget Aggregation

Rupert Freeman, Ulrike Schmidt-Kraepelin

AAAI 2024arXiv:2309.02613
budget aggregation problemmoving phantom mechanismstruthful mechanismsproject fairness+4
18
citations
#53

Diversity-Aware Policy Optimization for Large Language Model Reasoning

Jian Yao, Ran Cheng, Xingyu Wu et al.

NeurIPS 2025
18
citations
#54

Repeated Fair Allocation of Indivisible Items

Ayumi Igarashi, Martin Lackner, Oliviero Nardi et al.

AAAI 2024arXiv:2304.01644
fair allocationindivisible itemsenvy-freenessproportionality+4
18
citations
#55

Unprocessing Seven Years of Algorithmic Fairness

André F. Cruz, Moritz Hardt

ICLR 2024
18
citations
#56

Fair-VPT: Fair Visual Prompt Tuning for Image Classification

Sungho Park, Hyeran Byun

CVPR 2024
18
citations
#57

Aioli: A Unified Optimization Framework for Language Model Data Mixing

Mayee Chen, Michael Hu, Nicholas Lourie et al.

ICLR 2025
16
citations
#58

Model Equality Testing: Which Model is this API Serving?

Irena Gao, Percy Liang, Carlos Guestrin

ICLR 2025
16
citations
#59

Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs

Yuzhe Gu, Wenwei Zhang, Chengqi Lyu et al.

ICLR 2025
15
citations
#60

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning

Jaehun Jung, Seungju Han, Ximing Lu et al.

NeurIPS 2025arXiv:2505.20161
gradient-based diversificationdata diversity metricsout-of-distribution generalizationsynthetic data generation+3
15
citations
#61

Apollo-MILP: An Alternating Prediction-Correction Neural Solving Framework for Mixed-Integer Linear Programming

Haoyang Liu, Jie Wang, Zijie Geng et al.

ICLR 2025arXiv:2503.01129
mixed-integer linear programmingneural solving frameworktrust-region searchproblem reduction+4
15
citations
#62

Position: Editing Large Language Models Poses Serious Safety Risks

Paul Youssef, Zhixue Zhao, Daniel Braun et al.

ICML 2025
15
citations
#63

BaCon: Boosting Imbalanced Semi-supervised Learning via Balanced Feature-Level Contrastive Learning

Qianhan Feng, Lujing Xie, Shijie Fang et al.

AAAI 2024arXiv:2403.12986
semi-supervised learningclass imbalancecontrastive learningfeature-level regularization+4
15
citations
#64

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Xiangyuan Xue, Zeyu Lu, Di Huang et al.

CVPR 2025
15
citations
#65

RocketEval: Efficient automated LLM evaluation via grading checklist

Tianjun Wei, Wei Wen, Ruizhi Qiao et al.

ICLR 2025
15
citations
#66

MFABA: A More Faithful and Accelerated Boundary-Based Attribution Method for Deep Neural Networks

Zhiyu Zhu, Huaming Chen, Jiayu Zhang et al.

AAAI 2024arXiv:2312.13630
attribution methodsmodel interpretabilityboundary-based attributionsensitivity axiom+3
14
citations
#67

Optimizing Temperature for Language Models with Multi-Sample Inference

Weihua Du, Yiming Yang, Sean Welleck

ICML 2025
14
citations
#68

Towards Fair Graph Federated Learning via Incentive Mechanisms

12794 Chenglu Pan, Jiarong Xu, Yue Yu et al.

AAAI 2024arXiv:2312.13306
graph federated learningincentive mechanismsagent valuationgradient alignment+4
14
citations
#69

Regroup Median Loss for Combating Label Noise

Authors: Fengpeng Li, Kemou Li, Jinyu Tian et al.

AAAI 2024arXiv:2312.06273
label noisesmall-loss criterionrobust loss estimationsemi-supervised learning+3
14
citations
#70

Mix Data or Merge Models? Balancing the Helpfulness, Honesty, and Harmlessness of Large Language Model via Model Merging

Jinluan Yang, Dingnan Jin, Anke Tang et al.

NeurIPS 2025arXiv:2502.06876
model merging3h optimizationlarge language model alignmentparameter-level conflict resolution+4
13
citations
#71

Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset

Yingzi Ma, Jiongxiao Wang, Fei Wang et al.

ICLR 2025
13
citations
#72

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models

Song Wang, Peng Wang, Tong Zhou et al.

ICLR 2025arXiv:2407.02408
bias evaluationlarge language modelsfairness benchmarkscompositional taxonomy+3
13
citations
#73

How Contaminated Is Your Benchmark? Measuring Dataset Leakage in Large Language Models with Kernel Divergence

Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei et al.

ICML 2025
13
citations
#74

Discover and Mitigate Multiple Biased Subgroups in Image Classifiers

Zeliang Zhang, Mingqian Feng, Zhiheng Li et al.

CVPR 2024
12
citations
#75

Imputation for prediction: beware of diminishing returns.

Marine Le Morvan, Gael Varoquaux

ICLR 2025arXiv:2407.19804
missing value imputationpredictive modelingmissingness indicatorsimputation accuracy+4
12
citations
#76

Dissecting Submission Limit in Desk-Rejections: A Mathematical Analysis of Fairness in AI Conference Policies

Yuefan Cao, Xiaoyu Li, Yingyu Liang et al.

ICML 2025
12
citations
#77

Minimum-Norm Interpolation Under Covariate Shift

Neil Mallinar, Austin Zane, Spencer Frei et al.

ICML 2024
transfer learningcovariate shiftbenign overfittinglinear interpolation+3
12
citations
#78

Backdoor Cleaning without External Guidance in MLLM Fine-tuning

Xuankun Rong, Wenke Huang, Jian Liang et al.

NeurIPS 2025
12
citations
#79

Weighted-Reward Preference Optimization for Implicit Model Fusion

Ziyi Yang, Fanqi Wan, Longguang Zhong et al.

ICLR 2025
12
citations
#80

Differentiable Optimization of Similarity Scores Between Models and Brains

Nathan Cloos, Moufan Li, Markus Siegel et al.

ICLR 2025
11
citations
#81

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024arXiv:2401.07567
temporal sentence groundingdataset biasadversarial trainingmultimodal alignment+4
11
citations
#82

Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation

Hsiang Hsu, Guihong Li, Shaohan Hu et al.

ICLR 2024
11
citations
#83

Causal Fairness under Unobserved Confounding: A Neural Sensitivity Framework

Maresa Schröder, Dennis Frauen, Stefan Feuerriegel

ICLR 2024
11
citations
#84

Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models

Thomas Zollo, Todd Morrill, Zhun Deng et al.

ICLR 2024
11
citations
#85

CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning

Hyuck Lee, Heeyoung Kim

CVPR 2024
11
citations
#86

The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization

Jae-Won Chung, Jeff J. Ma, Ruofan Wu et al.

NeurIPS 2025arXiv:2505.06371
inference energy consumptionenergy measurement benchmarkgenerative ai servicesautomated optimization recommendations+2
10
citations
#87

Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi, Hunsang Lee, Seyoung Joung et al.

ECCV 2024arXiv:2404.08330
masked image modelingself-supervised learningmasked token optimizationpre-training efficiency+3
10
citations
#88

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao, Xuqi Liu, Zhongqi Yue et al.

ICCV 2025
10
citations
#89

Consistency Checks for Language Model Forecasters

Daniel Paleka, Abhimanyu Pallavi Sudhir, Alejandro Alvarez et al.

ICLR 2025arXiv:2412.18544
language model forecastingconsistency checksautomated evaluation systemarbitrage-based metrics+3
10
citations
#90

PLeaS - Merging Models with Permutations and Least Squares

Anshul Nasery, Jonathan Hayase, Pang Wei Koh et al.

CVPR 2025arXiv:2407.02447
model mergingpermutation symmetriesleast squares approximationfine-tuned models+3
10
citations
#91

HEIE: MLLM-Based Hierarchical Explainable AIGC Image Implausibility Evaluator

Fan Yang, Ru Zhen, Jianing Wang et al.

CVPR 2025
10
citations
#92

Adaptive Self-improvement LLM Agentic System for ML Library Development

Genghan Zhang, Weixin Liang, Olivia Hsu et al.

ICML 2025
10
citations
#93

Reliable and Efficient Amortized Model-based Evaluation

Sang Truong, Yuheng Tu, Percy Liang et al.

ICML 2025
10
citations
#94

Post-hoc bias scoring is optimal for fair classification

Wenlong Chen, Yegor Klochkov, Yang Liu

ICLR 2024
10
citations
#95

PAL: Sample-Efficient Personalized Reward Modeling for Pluralistic Alignment

Daiwei Chen, Yi Chen, Aniket Rege et al.

ICLR 2025
reward modelingpluralistic alignmentpersonalized preferencesfew-shot learning+3
9
citations
#96

Constrained Fair and Efficient Allocations

Benjamin Cookson, Soroush Ebadian, Nisarg Shah

AAAI 2025
9
citations
#97

FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication

Eric Slyman, Stefan Lee, Scott Cohen et al.

CVPR 2024
9
citations
#98

PowerMLP: An Efficient Version of KAN

Ruichen Qiu, Yibo Miao, Shiwen Wang et al.

AAAI 2025
9
citations
#99

MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe et al.

ICML 2025
9
citations
#100

The Non-Linear Representation Dilemma: Is Causal Abstraction Enough for Mechanistic Interpretability?

Denis Sutter, Julian Minder, Thomas Hofmann et al.

NeurIPS 2025
9
citations