Research Alpha Leak - Rising Stars in Research

#1

WorldSimBench: Towards Video Generation Models as World Simulators

Yiran Qin, Zhelun Shi, Jiwen Yu et al.

ICML 2025

806

citations

#2

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick et al.

ICML 2025

329

citations

#3

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma et al.

ICML 2025

190

citations

#4

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu, Zekun Wang, Junli Wang et al.

ICML 2025

165

citations

#5

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan, Xingyao Wang, Graham Neubig et al.

ICML 2025

130

citations

#6

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.

ICML 2025

123

citations

#7

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao et al.

ICML 2025

118

citations

#8

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li, Wenshan Wu, Huanyu Zhang et al.

ICML 2025

115

citations

#9

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

ICML 2025

110

citations

#10

Taming Rectified Flow for Inversion and Editing

Jiangshan Wang, Junfu Pu, Zhongang Qi et al.

ICML 2025

110

citations

#11

A General Framework for Inference-time Scaling and Steering of Diffusion Models

Raghav Singhal, Zachary Horvitz, Ryan Teehan et al.

ICML 2025

103

citations

#12

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu et al.

ICML 2025

103

citations

#13

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger et al.

ICML 2025

100

citations

#14

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li et al.

ICML 2025

98

citations

#15

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Jiaxing Cui, Wei-Lin Chiang, Ion Stoica et al.

ICML 2025

97

citations

#16

Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami, Alekh Agarwal, Jonathan Berant et al.

ICML 2025

89

citations

#17

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Dongzhi Jiang, Renrui Zhang, Ziyu Guo et al.

ICML 2025

88

citations

#18

Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction

Xiang Fu, Brandon Wood, Luis Barroso-Luque et al.

ICML 2025

87

citations

#19

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

ICML 2025

72

citations

#20

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine et al.

ICML 2025

68

citations

ICML

Top Papers in ICML 2025

WorldSimBench: Towards Video Generation Models as World Simulators

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Training Software Engineering Agents and Verifiers with SWE-Gym

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Layer by Layer: Uncovering Hidden Representations in Language Models

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Taming Rectified Flow for Inversion and Editing

A General Framework for Inference-time Scaling and Steering of Diffusion Models

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Theoretical guarantees on the best-of-n alignment policy

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Scaling Test-Time Compute Without Verification or RL is Suboptimal