Most Cited 2025 "frequency-wise self-attention" Papers

22,274 papers found • Page 2 of 112

#201

How Far Is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu et al.

ICML 2025arXiv:2411.02385
126
citations
#202

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun et al.

ICML 2025oralarXiv:2411.04983
126
citations
#203

SmolVLM: Redefining small and efficient multimodal models

Andrés Marafioti, Orr Zohar, Miquel Farré et al.

COLM 2025paperarXiv:2504.05299
125
citations
#204

Interpreting Emergent Planning in Model-Free Reinforcement Learning

Thomas Bush, Stephen Chung, Usman Anwar et al.

ICLR 2025arXiv:1901.03559
125
citations
#205

Scaling up Masked Diffusion Models on Text

Shen Nie, Fengqi Zhu, Chao Du et al.

ICLR 2025oralarXiv:2410.18514
124
citations
#206

ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu, Xu Huang, Xingshan Zeng et al.

ICLR 2025arXiv:2409.00920
124
citations
#207

MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds

Jiahui Lei, Yijia Weng, Adam W Harley et al.

CVPR 2025highlightarXiv:2405.17421
124
citations
#208

Data Scaling Laws in Imitation Learning for Robotic Manipulation

Fanqi Lin, Yingdong Hu, Pingyue Sheng et al.

ICLR 2025arXiv:2410.18647
123
citations
#209

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang, Yukang Chen, Zhuotao Tian et al.

CVPR 2025arXiv:2412.04467
123
citations
#210

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

Mohammadreza Pourreza, Hailong Li, Ruoxi Sun et al.

ICLR 2025arXiv:2410.01943
122
citations
#211

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Liliang Ren, Yang Liu, Yadong Lu et al.

ICLR 2025arXiv:2406.07522
122
citations
#212

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

Sili Chen, Hengkai Guo, Shengnan Zhu et al.

CVPR 2025highlightarXiv:2501.12375
121
citations
#213

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

ICLR 2025arXiv:2411.02337
121
citations
#214

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval

Hongjin SU, Howard Yen, Mengzhou Xia et al.

ICLR 2025arXiv:2407.12883
120
citations
#215

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang, XINGYU FU, James Y. Huang et al.

ICLR 2025oralarXiv:2406.09411
120
citations
#216

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, brian ichter, Michael Equi et al.

ICML 2025arXiv:2502.19417
120
citations
#217

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Rui-Jie Zhu, Qihang Zhao, Jason Eshraghian et al.

ICLR 2025arXiv:2302.13939
120
citations
#218

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Lijie Fan, Tianhong Li, Siyang Qin et al.

ICLR 2025arXiv:2410.13863
119
citations
#219

A General Framework for Inference-time Scaling and Steering of Diffusion Models

Raghav Singhal, Zachary Horvitz, Ryan Teehan et al.

ICML 2025arXiv:2501.06848
119
citations
#220

Taming Rectified Flow for Inversion and Editing

Jiangshan Wang, Junfu Pu, Zhongang Qi et al.

ICML 2025arXiv:2411.04746
119
citations
#221

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Botao Ye, Sifei Liu, Haofei Xu et al.

ICLR 2025arXiv:2410.24207
119
citations
#222

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger et al.

ICML 2025spotlightarXiv:2501.17148
118
citations
#223

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Jiasheng Ye, Peiju Liu, Tianxiang Sun et al.

ICLR 2025arXiv:2403.16952
118
citations
#224

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xeron Du, Yifan Yao, Kaijing Ma et al.

NEURIPS 2025arXiv:2502.14739
118
citations
#225

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin et al.

ICLR 2025oralarXiv:2407.12781
118
citations
#226

AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen et al.

NEURIPS 2025arXiv:2505.24298
117
citations
#227

Free Process Rewards without Process Labels

Lifan Yuan, Wendi Li, Huayu Chen et al.

ICML 2025arXiv:2412.01981
117
citations
#228

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Shaolei Zhang, Qingkai Fang, Yang et al.

ICLR 2025arXiv:2501.03895
117
citations
#229

MASt3R-SLAM: Real-Time Dense SLAM with 3D Reconstruction Priors

Riku Murai, Eric Dexheimer, Andrew J. Davison

CVPR 2025highlightarXiv:2412.12392
117
citations
#230

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang, Rui Meng, Xinyi Yang et al.

ICLR 2025arXiv:2410.05160
117
citations
#231

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Min Shi, Fuxiao Liu, Shihao Wang et al.

ICLR 2025arXiv:2408.15998
116
citations
#232

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

Yuchen Lin, Ronan Le Bras, Kyle Richardson et al.

ICML 2025arXiv:2502.01100
116
citations
#233

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Zeyuan Allen-Zhu, Yuanzhi Li

ICLR 2025arXiv:2404.05405
116
citations
#234

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei et al.

ICLR 2025arXiv:2410.02644
116
citations
#235

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Senqiao Yang, Jiaming Liu, Renrui Zhang et al.

AAAI 2025paperarXiv:2312.14074
116
citations
#236

Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling

Kaiwen Zheng, Yongxin Chen, Hanzi Mao et al.

ICLR 2025arXiv:2409.02908
116
citations
#237

Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang, Junchao Liao, Menghao Li et al.

CVPR 2025arXiv:2407.21705
115
citations
#238

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Xiaogeng Liu, Peiran Li, G. Edward Suh et al.

ICLR 2025arXiv:2410.05295
115
citations
#239

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang et al.

NEURIPS 2025arXiv:2503.01840
115
citations
#240

Rethinking Joint Maximum Mean Discrepancy for Visual Domain Adaptation

Wei Wang, Haifeng Xia, Chao Huang et al.

NEURIPS 2025oral
115
citations
#241

Agent-as-a-Judge: Evaluate Agents with Agents

Mingchen Zhuge, Changsheng Zhao, Dylan Ashley et al.

ICML 2025arXiv:2410.10934
114
citations
#242

Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan et al.

ICLR 2025arXiv:2408.00761
113
citations
#243

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Quanfeng Lu, Wenqi Shao, Zitao Liu et al.

ICCV 2025arXiv:2406.08451
113
citations
#244

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang et al.

ICML 2025spotlightarXiv:2412.14803
113
citations
#245

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu et al.

ICML 2025arXiv:2411.00774
112
citations
#246

A Sanity Check for AI-generated Image Detection

Shilin Yan, Ouxiang Li, Jiayin Cai et al.

ICLR 2025arXiv:2406.19435
112
citations
#247

HelpSteer2-Preference: Complementing Ratings with Preferences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau et al.

ICLR 2025arXiv:2410.01257
112
citations
#248

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Enze Xie, Junsong Chen, Yuyang Zhao et al.

ICML 2025arXiv:2501.18427
111
citations
#249

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

Yuchen Duan, Weiyun Wang, Zhe Chen et al.

ICLR 2025arXiv:2403.02308
111
citations
#250

EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE

Zeyi Liao, Lingbo Mo, Chejian Xu et al.

ICLR 2025arXiv:2409.11295
111
citations
#251

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee et al.

CVPR 2025arXiv:2409.17146
111
citations
#252

RM-Bench: Benchmarking Reward Models of Language Models with Subtlety and Style

Yantao Liu, Zijun Yao, Rui Min et al.

ICLR 2025arXiv:2410.16184
110
citations
#253

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Feng Liu, Shiwei Zhang, Xiaofeng Wang et al.

CVPR 2025highlightarXiv:2411.19108
110
citations
#254

Autoregressive Video Generation without Vector Quantization

Haoge Deng, Ting Pan, Haiwen Diao et al.

ICLR 2025oralarXiv:2412.14169
110
citations
#255

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Jianhong Bai, Menghan Xia, Xiao Fu et al.

ICCV 2025arXiv:2503.11647
110
citations
#256

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao, Min Zhang, Wei Zhao et al.

AAAI 2025paperarXiv:2403.14520
110
citations
#257

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu, Zhejun Jiang, Jingyuan Liu et al.

NEURIPS 2025spotlightarXiv:2502.13189
109
citations
#258

On the self-verification limitations of large language models on reasoning and planning tasks

Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati

ICLR 2025arXiv:2402.08115
109
citations
#259

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

ICML 2025oralarXiv:2502.17424
108
citations
#260

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency

Yiming Xie, Chun-Han Yao, Vikram Voleti et al.

ICLR 2025oralarXiv:2407.17470
108
citations
#261

SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?

John Yang, Carlos E Jimenez, Alex Zhang et al.

ICLR 2025arXiv:2410.03859
108
citations
#262

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Saaket Agashe, Jiuzhou Han, Shuyu Gan et al.

ICLR 2025arXiv:2410.08164
107
citations
#263

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

Rundi Wu, Ruiqi Gao, Ben Poole et al.

CVPR 2025arXiv:2411.18613
107
citations
#264

IMAGDressing-v1: Customizable Virtual Dressing

Fei Shen, Xin Jiang, Xin He et al.

AAAI 2025paperarXiv:2407.12705
107
citations
#265

OmniRe: Omni Urban Scene Reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang et al.

ICLR 2025arXiv:2408.16760
107
citations
#266

DEIM: DETR with Improved Matching for Fast Convergence

Shihua Huang, Zhichao Lu, Xiaodong Cun et al.

CVPR 2025arXiv:2412.04234
107
citations
#267

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

Litu Rout, Yujia Chen, Nataniel Ruiz et al.

ICLR 2025arXiv:2410.10792
106
citations
#268

MS-Diffusion: Multi-subject Zero-shot Image Personalization with Layout Guidance

Xierui Wang, Siming Fu, Qihan Huang et al.

ICLR 2025arXiv:2406.07209
106
citations
#269

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng, Junlin Lv, Yukun Cao et al.

NEURIPS 2025arXiv:2407.11550
106
citations
#270

VideoPhy: Evaluating Physical Commonsense for Video Generation

Hritik Bansal, Zongyu Lin, Tianyi Xie et al.

ICLR 2025arXiv:2406.03520
106
citations
#271

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Huanjin Yao, Jiaxing Huang, Wenhao Wu et al.

NEURIPS 2025spotlightarXiv:2412.18319
106
citations
#272

Show-o2: Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

NEURIPS 2025oralarXiv:2506.15564
106
citations
#273

DPO Meets PPO: Reinforced Token Optimization for RLHF

Han Zhong, Zikang Shan, Guhao Feng et al.

ICML 2025spotlightarXiv:2404.18922
106
citations
#274

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

Xianjie Wu, Jian Yang, Linzheng Chai et al.

AAAI 2025paperarXiv:2408.09174
105
citations
#275

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu et al.

NEURIPS 2025arXiv:2505.10978
105
citations
#276

RegMix: Data Mixture as Regression for Language Model Pre-training

Qian Liu, Xiaosen Zheng, Niklas Muennighoff et al.

ICLR 2025arXiv:2407.01492
105
citations
#277

MLVU: Benchmarking Multi-task Long Video Understanding

Junjie Zhou, Yan Shu, Bo Zhao et al.

CVPR 2025arXiv:2406.04264
105
citations
#278

FoundationStereo: Zero-Shot Stereo Matching

Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.

CVPR 2025arXiv:2501.09898
105
citations
#279

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li et al.

ICML 2025arXiv:2501.18362
105
citations
#280

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Frank (Fangzheng) Xu, Yufan Song, Boxuan Li et al.

NEURIPS 2025arXiv:2412.14161
105
citations
#281

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark

Wenhao Chai, Enxin Song, Yilun Du et al.

ICLR 2025oralarXiv:2410.03051
105
citations
#282

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Jiaxing Cui, Wei-Lin Chiang, Ion Stoica et al.

ICML 2025arXiv:2405.20947
104
citations
#283

Long-Context LLMs Meet RAG: Overcoming Challenges for Long Inputs in RAG

Bowen Jin, Jinsung Yoon, Jiawei Han et al.

ICLR 2025arXiv:2410.05983
104
citations
#284

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Mingjie Liu, Shizhe Diao, Ximing Lu et al.

NEURIPS 2025arXiv:2505.24864
104
citations
#285

Transformers without Normalization

Jiachen Zhu, Xinlei Chen, Kaiming He et al.

CVPR 2025arXiv:2503.10622
103
citations
#286

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He et al.

CVPR 2025highlightarXiv:2411.17440
103
citations
#287

LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs

Yushi Bai, Jiajie Zhang, Xin Lv et al.

ICLR 2025arXiv:2408.07055
103
citations
#288

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Wenkai Yang, Shuming Ma, Yankai Lin et al.

NEURIPS 2025arXiv:2502.18080
103
citations
#289

LLaVA-Critic: Learning to Evaluate Multimodal Models

Tianyi Xiong, Xiyao Wang, Dong Guo et al.

CVPR 2025arXiv:2410.02712
103
citations
#290

MeshAnything: Artist-Created Mesh Generation with Autoregressive Transformers

Yiwen Chen, Tong He, Di Huang et al.

ICLR 2025arXiv:2406.10163
102
citations
#291

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Gen Luo, Yiyi Zhou, Yuxin Zhang et al.

ICLR 2025arXiv:2403.03003
102
citations
#292

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci et al.

ICML 2025arXiv:2409.08264
102
citations
#293

The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning

Shivam Agarwal, Zimin Zhang, Lifan Yuan et al.

NEURIPS 2025arXiv:2505.15134
102
citations
#294

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Jaeyeon Kim, Kulin Shah, Vasilis Kontonis et al.

ICML 2025oralarXiv:2502.06768
102
citations
#295

Not All Language Model Features Are One-Dimensionally Linear

Josh Engels, Eric Michaud, Isaac Liao et al.

ICLR 2025arXiv:2405.14860
101
citations
#296

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas et al.

ICCV 2025arXiv:2412.08629
101
citations
#297

T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT

Dongzhi JIANG, Ziyu Guo, Renrui Zhang et al.

NEURIPS 2025arXiv:2505.00703
100
citations
#298

Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang, Ji Xie, Yu Lu et al.

NEURIPS 2025arXiv:2504.20690
100
citations
#299

TimeCMA: Towards LLM-Empowered Multivariate Time Series Forecasting via Cross-Modality Alignment

Chenxi Liu, Qianxiong Xu, Hao Miao et al.

AAAI 2025paperarXiv:2406.01638
100
citations
#300

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Haotian Tang, Yecheng Wu, Shang Yang et al.

ICLR 2025arXiv:2410.10812
100
citations
#301

Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark

Yunzhuo Hao, Jiawei Gu, Huichen Wang et al.

ICML 2025oralarXiv:2501.05444
100
citations
#302

Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Tian Ye, Zicheng Xu, Yuanzhi Li et al.

ICLR 2025arXiv:2407.20311
100
citations
#303

PaperBench: Evaluating AI’s Ability to Replicate AI Research

Giulio Starace, Oliver Jaffe, Dane Sherburn et al.

ICML 2025oralarXiv:2504.01848
99
citations
#304

Magma: A Foundation Model for Multimodal AI Agents

Jianwei Yang, Reuben Tan, Qianhui Wu et al.

CVPR 2025arXiv:2502.13130
99
citations
#305

FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.

CVPR 2025arXiv:2502.12138
99
citations
#306

Consistency Models Made Easy

Zhengyang Geng, Ashwini Pokle, Weijian Luo et al.

ICLR 2025arXiv:2406.14548
99
citations
#307

ImgEdit: A Unified Image Editing Dataset and Benchmark

Yang Ye, Xianyi He, Zongjian Li et al.

NEURIPS 2025arXiv:2505.20275
98
citations
#308

WebDancer: Towards Autonomous Information Seeking Agency

Jialong Wu, Baixuan Li, Runnan Fang et al.

NEURIPS 2025arXiv:2505.22648
98
citations
#309

Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang, Congliang Chen, Ziniu Li et al.

ICLR 2025arXiv:2406.16793
98
citations
#310

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

Muyang Li, Yujun Lin, Zhekai Zhang et al.

ICLR 2025arXiv:2411.05007
98
citations
#311

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang(Jeremy) Chen, Junyu Zhang et al.

ICML 2025oralarXiv:2502.09560
98
citations
#312

When Attention Sink Emerges in Language Models: An Empirical View

Xiangming Gu, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2410.10781
98
citations
#313

Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content

Qiuheng Wang, Yukai Shi, Jiarong Ou et al.

CVPR 2025arXiv:2410.08260
97
citations
#314

Motion Prompting: Controlling Video Generation with Motion Trajectories

Daniel Geng, Charles Herrmann, Junhwa Hur et al.

CVPR 2025arXiv:2412.02700
96
citations
#315

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Kaiyue Sun, Kaiyi Huang, Xian Liu et al.

CVPR 2025arXiv:2407.14505
96
citations
#316

Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction

Xiang Fu, Brandon Wood, Luis Barroso-Luque et al.

ICML 2025oralarXiv:2502.12147
96
citations
#317

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Haian Jin, Hanwen Jiang, Hao Tan et al.

ICLR 2025arXiv:2410.17242
96
citations
#318

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong, Zuyan Liu, Hai-Long Sun et al.

CVPR 2025highlightarXiv:2411.14432
96
citations
#319

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

Jianwen Jiang, Chao Liang, Jiaqi Yang et al.

ICLR 2025oralarXiv:2409.02634
95
citations
#320

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Yuang Peng, Yuxin Cui, Haomiao Tang et al.

ICLR 2025arXiv:2406.16855
95
citations
#321

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Ryan Liu, Jiayi Geng, Addison J. Wu et al.

ICML 2025arXiv:2410.21333
95
citations
#322

Weak-to-Strong Jailbreaking on Large Language Models

Xuandong Zhao, Xianjun Yang, Tianyu Pang et al.

ICML 2025arXiv:2401.17256
95
citations
#323

CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding

Jiquan Wang, Sha Zhao, Zhiling Luo et al.

ICLR 2025oralarXiv:2412.07236
95
citations
#324

CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning

Peiyuan Liu, Hang Guo, Tao Dai et al.

AAAI 2025paperarXiv:2403.07300
95
citations
#325

Theoretical guarantees on the best-of-n alignment policy

Ahmad Beirami, Alekh Agarwal, Jonathan Berant et al.

ICML 2025arXiv:2401.01879
95
citations
#326

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

Xinlei Chen, Zhuang Liu, Saining Xie et al.

ICLR 2025arXiv:2401.14404
95
citations
#327

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

Jintao Zhang, Jia wei, Pengle Zhang et al.

ICLR 2025arXiv:2410.02367
95
citations
#328

Remarkable Robustness of LLMs: Stages of Inference?

Vedang Lad, Jin Hwa Lee, Wes Gurnee et al.

NEURIPS 2025oralarXiv:2406.19384
95
citations
#329

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

Yuheng Ji, Huajie Tan, Jiayu Shi et al.

CVPR 2025arXiv:2502.21257
94
citations
#330

ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse, Hugues Sibille, Tony Wu et al.

ICLR 2025arXiv:2407.01449
94
citations
#331

AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation

Yuning Cui, Syed Waqas Zamir, Salman Khan et al.

ICLR 2025arXiv:2403.14614
94
citations
#332

Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

Minh Nguyen, Andrew Baker, Clement Neo et al.

ICLR 2025arXiv:2407.01082
94
citations
#333

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Dongzhi Jiang, Renrui Zhang, Ziyu Guo et al.

ICML 2025arXiv:2502.09621
94
citations
#334

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Sreyan Ghosh, Arushi Goel, Jaehyeon Kim et al.

NEURIPS 2025spotlightarXiv:2507.08128
94
citations
#335

Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

Yang Tian, Sizhe Yang, Jia Zeng et al.

ICLR 2025arXiv:2412.15109
93
citations
#336

Randomized Autoregressive Visual Generation

Qihang Yu, Ju He, Xueqing Deng et al.

ICCV 2025arXiv:2411.00776
93
citations
#337

TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis

Shiyu Wang, Jiawei LI, Xiaoming Shi et al.

ICLR 2025oralarXiv:2410.16032
93
citations
#338

Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution

Zhiyuan You, Xin Cai, Jinjin Gu et al.

CVPR 2025arXiv:2501.11561
92
citations
#339

Kolmogorov-Arnold Transformer

Xingyi Yang, Xinchao Wang

ICLR 2025arXiv:2409.10594
92
citations
#340

Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing

Bingliang Zhang, Wenda Chu, Julius Berner et al.

CVPR 2025arXiv:2407.01521
92
citations
#341

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

gaojie lin, Jianwen Jiang, Jiaqi Yang et al.

ICCV 2025highlightarXiv:2502.01061
91
citations
#342

OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision

Cong Wei, Zheyang Xiong, Weiming Ren et al.

ICLR 2025arXiv:2411.07199
91
citations
#343

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Jonas Gehring, Kunhao Zheng, Jade Copet et al.

ICML 2025spotlightarXiv:2410.02089
91
citations
#344

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Chengke Zou, Xingang Guo, Rui Yang et al.

ICLR 2025arXiv:2411.00836
91
citations
#345

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

shaojin wu, Mengqi Huang, wenxu wu et al.

ICCV 2025arXiv:2504.02160
90
citations
#346

OGBench: Benchmarking Offline Goal-Conditioned RL

Seohong Park, Kevin Frans, Benjamin Eysenbach et al.

ICLR 2025arXiv:2410.20092
90
citations
#347

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay et al.

CVPR 2025arXiv:2411.16537
90
citations
#348

Agent Workflow Memory

Zhiruo Wang, Jiayuan Mao, Daniel Fried et al.

ICML 2025arXiv:2409.07429
90
citations
#349

Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-based Decoding

Xiner Li, Yulai Zhao, Chenyu Wang et al.

NEURIPS 2025arXiv:2408.08252
90
citations
#350

RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

Di Liu, Meng Chen, Baotong Lu et al.

NEURIPS 2025arXiv:2409.10516
90
citations
#351

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

Zhihang Lin, Mingbao Lin, Luxi Lin et al.

AAAI 2025paperarXiv:2405.05803
90
citations
#352

Remasking Discrete Diffusion Models with Inference-Time Scaling

Guanghan Wang, Yair Schiff, Subham Sahoo et al.

NEURIPS 2025arXiv:2503.00307
90
citations
#353

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

Zhewei Kang, Xuandong Zhao, Dawn Song

NEURIPS 2025arXiv:2502.18581
89
citations
#354

The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning

Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.

NEURIPS 2025arXiv:2506.01347
89
citations
#355

Making Text Embedders Few-Shot Learners

Chaofan Li, Minghao Qin, Shitao Xiao et al.

ICLR 2025arXiv:2409.15700
89
citations
#356

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen et al.

CVPR 2025arXiv:2411.07975
89
citations
#357

Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning

Zhenni Bi, Kai Han, Chuanjian Liu et al.

ICML 2025arXiv:2412.09078
89
citations
#358

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion

Wenqiang Sun, Shuo Chen, Fangfu Liu et al.

ICCV 2025
89
citations
#359

SWE-smith: Scaling Data for Software Engineering Agents

John Yang, Kilian Lieret, Carlos Jimenez et al.

NEURIPS 2025spotlightarXiv:2504.21798
89
citations
#360

Unlocking Guidance for Discrete State-Space Diffusion and Flow Models

Hunter Nisonoff, Junhao Xiong, Stephan Allenspach et al.

ICLR 2025arXiv:2406.01572
88
citations
#361

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

jiarui zhang, Mahyar Khayatkhoei, Prateek Chhikara et al.

ICLR 2025arXiv:2502.17422
88
citations
#362

Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Sreyan Ghosh, Zhifeng Kong, Sonal Kumar et al.

ICML 2025arXiv:2503.03983
88
citations
#363

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

Colin White, Samuel Dooley, Manley Roberts et al.

ICLR 2025arXiv:2406.19314
88
citations
#364

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Jiahao Shao, Yuanbo Yang, Hongyu Zhou et al.

CVPR 2025arXiv:2406.01493
87
citations
#365

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.

ICLR 2025arXiv:2406.04303
87
citations
#366

AnalogCoder: Analog Circuit Design via Training-Free Code Generation

Yao Lai, Sungyoung Lee, Guojin Chen et al.

AAAI 2025paperarXiv:2405.14918
87
citations
#367

Stable Virtual Camera: Generative View Synthesis with Diffusion Models

Jensen Zhou, Hang Gao, Vikram Voleti et al.

ICCV 2025arXiv:2503.14489
87
citations
#368

d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Siyan Zhao, Devaansh Gupta, Qinqing Zheng et al.

NEURIPS 2025spotlightarXiv:2504.12216
87
citations
#369

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

ICLR 2025arXiv:2403.20271
87
citations
#370

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Rosie Zhao, Alexandru Meterez, Sham M. Kakade et al.

COLM 2025paperarXiv:2504.07912
87
citations
#371

OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation

Mengkang Hu, Yuhang Zhou, Wendong Fan et al.

NEURIPS 2025arXiv:2505.23885
87
citations
#372

DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation

Guosheng Zhao, Chaojun Ni, Xiaofeng Wang et al.

CVPR 2025arXiv:2410.13571
87
citations
#373

GraphRouter: A Graph-based Router for LLM Selections

Tao Feng, Yanzhen Shen, Jiaxuan You

ICLR 2025arXiv:2410.03834
87
citations
#374

General-Reasoner: Advancing LLM Reasoning Across All Domains

Xueguang Ma, Qian Liu, Dongfu Jiang et al.

NEURIPS 2025arXiv:2505.14652
86
citations
#375

MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.

ICLR 2025arXiv:2411.02571
86
citations
#376

LMFusion: Adapting Pretrained Language Models for Multimodal Generation

Weijia Shi, Xiaochuang Han, Chunting Zhou et al.

NEURIPS 2025arXiv:2412.15188
86
citations
#377

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.

CVPR 2025arXiv:2504.04348
86
citations
#378

Safety Layers in Aligned Large Language Models: The Key to LLM Security

Shen Li, Liuyi Yao, Lan Zhang et al.

ICLR 2025arXiv:2408.17003
86
citations
#379

Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances

Shilin Lu, Zihan Zhou, Jiayou Lu et al.

ICLR 2025arXiv:2410.18775
86
citations
#380

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

Pengyang Ling, Jiazi Bu, Pan Zhang et al.

ICLR 2025oralarXiv:2406.05338
86
citations
#381

REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.

ICCV 2025arXiv:2504.10483
85
citations
#382

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang et al.

CVPR 2025arXiv:2412.06974
85
citations
#383

Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.

ICLR 2025arXiv:2411.14257
85
citations
#384

AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation

Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.

ICLR 2025arXiv:2410.00371
85
citations
#385

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy et al.

ICLR 2025arXiv:2409.05907
85
citations
#386

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du et al.

ICLR 2025arXiv:2405.21018
85
citations
#387

Point Cloud Mamba: Point Cloud Learning via State Space Model

Tao Zhang, Haobo Yuan, Lu Qi et al.

AAAI 2025paperarXiv:2403.00762
84
citations
#388

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie et al.

NEURIPS 2025oralarXiv:2505.17685
84
citations
#389

TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

Jingang QU, David Holzmüller, Gael Varoquaux et al.

ICML 2025arXiv:2502.05564
84
citations
#390

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer

Jiahao Cui, Hui Li, Qingkun Su et al.

CVPR 2025arXiv:2412.00733
84
citations
#391

Training-free Camera Control for Video Generation

Chen Hou, Zhibo Chen

ICLR 2025arXiv:2406.10126
84
citations
#392

Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

Jiacheng Ye, Jiahui Gao, Shansan Gong et al.

ICLR 2025arXiv:2410.14157
84
citations
#393

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Lutfi Erdogan, Hiroki Furuta, Sehoon Kim et al.

ICML 2025arXiv:2503.09572
84
citations
#394

MambaIRv2: Attentive State Space Restoration

Hang Guo, Yong Guo, Yaohua Zha et al.

CVPR 2025arXiv:2411.15269
84
citations
#395

CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

Hyungjin Chung, Jeongsol Kim, Geon Yeong Park et al.

ICLR 2025arXiv:2406.08070
83
citations
#396

Multi-agent Architecture Search via Agentic Supernet

Guibin Zhang, Luyang Niu, Junfeng Fang et al.

ICML 2025oralarXiv:2502.04180
83
citations
#397

WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration

Yao Zhang, Zijian Ma, Yunpu Ma et al.

AAAI 2025paperarXiv:2408.15978
83
citations
#398

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Peng Xia, Kangyu Zhu, Haoran Li et al.

ICLR 2025arXiv:2410.13085
83
citations
#399

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu et al.

NEURIPS 2025arXiv:2504.03601
82
citations
#400

DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching

Ming Gui, Johannes Schusterbauer, Ulrich Prestel et al.

AAAI 2025paper
82
citations