Most Cited 2025 "stochastic perturbation" Papers

22,274 papers found • Page 4 of 112

#601

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness

Tianyu Yu, Haoye Zhang, Qiming Li et al.

CVPR 2025highlightarXiv:2405.17220
60
citations
#602

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.

ICLR 2025arXiv:2409.01586
60
citations
#603

Self-Improvement in Language Models: The Sharpening Mechanism

Audrey Huang, Adam Block, Dylan Foster et al.

ICLR 2025arXiv:2412.01951
60
citations
#604

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye, Yukang Gan, Xiaoke Huang et al.

CVPR 2025arXiv:2406.12275
60
citations
#605

Stable Flow: Vital Layers for Training-Free Image Editing

Omri Avrahami, Or Patashnik, Ohad Fried et al.

CVPR 2025arXiv:2411.14430
60
citations
#606

Repetition Improves Language Model Embeddings

Jacob Springer, Suhas Kotha, Daniel Fried et al.

ICLR 2025arXiv:2402.15449
60
citations
#607

AutoML-Agent: A Multi-Agent LLM Framework for Full-Pipeline AutoML

Patara Trirat, Wonyong Jeong, Sung Ju Hwang

ICML 2025arXiv:2410.02958
60
citations
#608

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Yu Fu, Zefan Cai, Abedelkadir Asi et al.

ICLR 2025arXiv:2410.19258
60
citations
#609

Long Context Tuning for Video Generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.

ICCV 2025arXiv:2503.10589
60
citations
#610

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Xuemeng Yang, Licheng Wen, Tiantian Wei et al.

ICCV 2025arXiv:2408.00415
60
citations
#611

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

NEURIPS 2025spotlightarXiv:2505.13227
60
citations
#612

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.

AAAI 2025paperarXiv:2408.08640
60
citations
#613

Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens

Zhangqi Jiang, Junkai Chen, Beier Zhu et al.

CVPR 2025arXiv:2411.16724
59
citations
#614

Tell me about yourself: LLMs are aware of their learned behaviors

Jan Betley, Xuchan Bao, Martín Soto et al.

ICLR 2025oralarXiv:2501.11120
59
citations
#615

Unlocking the Power of LSTM for Long Term Time Series Forecasting

Yaxuan Kong, Zepu Wang, Yuqi Nie et al.

AAAI 2025paperarXiv:2408.10006
59
citations
#616

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Lvmin Zhang, Shengqu Cai, Muyang Li et al.

NEURIPS 2025spotlightarXiv:2504.12626
59
citations
#617

Golden Noise for Diffusion Models: A Learning Framework

zikai zhou, Shitong Shao, Lichen Bai et al.

ICCV 2025arXiv:2411.09502
59
citations
#618

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen et al.

NEURIPS 2025arXiv:2507.16815
59
citations
#619

Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment

Lijie Liu, Tianxiang Ma, Bingchuan Li et al.

ICCV 2025highlightarXiv:2502.11079
59
citations
#620

Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

Edward LOO, Tianyu HUANG, Peng Li et al.

CVPR 2025highlightarXiv:2412.03079
59
citations
#621

GRIT: Teaching MLLMs to Think with Images

Yue Fan, Xuehai He, Diji Yang et al.

NEURIPS 2025arXiv:2505.15879
59
citations
#622

Normalizing Flows are Capable Generative Models

Shuangfei Zhai, Ruixiang Zhang, Preetum Nakkiran et al.

ICML 2025oralarXiv:2412.06329
59
citations
#623

Learning Multi-Level Features with Matryoshka Sparse Autoencoders

Bart Bussmann, Noa Nabeshima, Adam Karvonen et al.

ICML 2025arXiv:2503.17547
58
citations
#624

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

Enshen Zhou, Jingkun An, Cheng Chi et al.

NEURIPS 2025arXiv:2506.04308
58
citations
#625

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Zhengyao Lyu, Chenyang Si, Junhao Song et al.

ICLR 2025oralarXiv:2410.19355
58
citations
#626

NeuroLM: A Universal Multi-task Foundation Model for Bridging the Gap between Language and EEG Signals

Wei-Bang Jiang, Yansen Wang, Bao-liang Lu et al.

ICLR 2025oralarXiv:2409.00101
58
citations
#627

Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation

Mufei Li, Siqi Miao, Pan Li

ICLR 2025arXiv:2410.20724
58
citations
#628

ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration

Chaojun Ni, Guosheng Zhao, Xiaofeng Wang et al.

CVPR 2025arXiv:2411.19548
58
citations
#629

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

Adam Karvonen, Can Rager, Johnny Lin et al.

ICML 2025arXiv:2503.09532
58
citations
#630

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Chen Ziwen, Hao Tan, Kai Zhang et al.

ICCV 2025highlightarXiv:2410.12781
58
citations
#631

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qi Qin, Le Zhuo, Yi Xin et al.

ICCV 2025arXiv:2503.21758
58
citations
#632

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Xiang Li, Cristina Mata, Jongwoo Park et al.

ICLR 2025arXiv:2406.20095
58
citations
#633

EdgeRunner: Auto-regressive Auto-encoder for Artistic Mesh Generation

Jiaxiang Tang, Max Li, Zekun Hao et al.

ICLR 2025arXiv:2409.18114
58
citations
#634

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li et al.

ICLR 2025arXiv:2406.05127
58
citations
#635

How to Evaluate Reward Models for RLHF

Evan Frick, Tianle Li, Connor Chen et al.

ICLR 2025arXiv:2410.14872
58
citations
#636

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng et al.

CVPR 2025arXiv:2409.12259
58
citations
#637

What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

Guangkai Xu, yongtao ge, Mingyu Liu et al.

ICLR 2025arXiv:2403.06090
58
citations
#638

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao et al.

ICLR 2025arXiv:2411.13676
58
citations
#639

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Junxian Li, Di Zhang, Xunzhi Wang et al.

AAAI 2025paperarXiv:2408.07246
58
citations
#640

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Mingyang Chen, Linzhuang Sun, Tianpeng Li et al.

NEURIPS 2025arXiv:2503.19470
57
citations
#641

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Yongxin Guo, Jingyu Liu, Mingda Li et al.

AAAI 2025paperarXiv:2405.13382
57
citations
#642

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

Songhao Han, Wei Huang, Hairong Shi et al.

CVPR 2025arXiv:2411.14794
57
citations
#643

Physics-Informed Diffusion Models

Jan-Hendrik Bastek, WaiChing Sun, Dennis Kochmann

ICLR 2025arXiv:2403.14404
57
citations
#644

MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix

Ziyang Ma, Yinghao Ma, Yanqiao Zhu et al.

NEURIPS 2025arXiv:2505.13032
57
citations
#645

SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang et al.

CVPR 2025highlightarXiv:2412.09401
57
citations
#646

NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation

Xiangyan Liu, Jinjie Ni, Zijian Wu et al.

NEURIPS 2025arXiv:2504.13055
57
citations
#647

Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers

Lei Chen, Yuan Meng, Chen Tang et al.

CVPR 2025arXiv:2406.17343
57
citations
#648

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt et al.

ICML 2025arXiv:2502.05167
57
citations
#649

Controlling Space and Time with Diffusion Models

Daniel Watson, Saurabh Saxena, Lala Li et al.

ICLR 2025arXiv:2407.07860
57
citations
#650

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.

CVPR 2025arXiv:2503.02175
57
citations
#651

LLM-SR: Scientific Equation Discovery via Programming with Large Language Models

Parshin Shojaee, Kazem Meidani, Shashank Gupta et al.

ICLR 2025arXiv:2404.18400
57
citations
#652

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

caigao jiang, Xiang Shu, Hong Qian et al.

ICLR 2025arXiv:2410.13213
57
citations
#653

Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent

Yangning Li, Yinghui Li, Xinyu Wang et al.

ICLR 2025arXiv:2411.02937
56
citations
#654

DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving

Wencheng Han, Dongqian Guo, Cheng-Zhong Xu et al.

AAAI 2025paperarXiv:2401.03641
56
citations
#655

Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection

Jia Guo, Shuai Lu, Weihang Zhang et al.

CVPR 2025arXiv:2405.14325
56
citations
#656

RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph

Siru Ouyang, Wenhao Yu, Kaixin Ma et al.

ICLR 2025arXiv:2410.14684
56
citations
#657

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing

Subhash Kantamneni, Josh Engels, Senthooran Rajamanoharan et al.

ICML 2025arXiv:2502.16681
56
citations
#658

Towards Interpreting Visual Information Processing in Vision-Language Models

Clement Neo, Luke Ong, Philip Torr et al.

ICLR 2025arXiv:2410.07149
56
citations
#659

An Illusion of Progress? Assessing the Current State of Web Agents

Tianci Xue, Weijian Qi, Tianneng Shi et al.

COLM 2025paperarXiv:2504.01382
56
citations
#660

Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

Seyedmorteza Sadat, Otmar Hilliges, Romann Weber

ICLR 2025arXiv:2410.02416
56
citations
#661

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.

ICML 2025arXiv:2410.01679
56
citations
#662

CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

Guy Tevet, Sigal Raab, Setareh Cohan et al.

ICLR 2025arXiv:2410.03441
56
citations
#663

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei et al.

CVPR 2025arXiv:2412.12507
56
citations
#664

HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

Prithviraj Banerjee, Sindi Shkodrani, Pierre Moulon et al.

CVPR 2025highlightarXiv:2411.19167
56
citations
#665

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

Shuangrui Ding, Rui Qian, Xiaoyi Dong et al.

ICCV 2025arXiv:2410.16268
56
citations
#666

FlowPolicy: Enabling Fast and Robust 3D Flow-Based Policy via Consistency Flow Matching for Robot Manipulation

Qinglun Zhang, Zhen Liu, Haoqiang Fan et al.

AAAI 2025paperarXiv:2412.04987
56
citations
#667

Simplifying Deep Temporal Difference Learning

Matteo Gallici, Mattie Fellows, Benjamin Ellis et al.

ICLR 2025oralarXiv:2407.04811
56
citations
#668

VisionTS: Visual Masked Autoencoders Are Free-Lunch Zero-Shot Time Series Forecasters

Mouxiang Chen, Lefei Shen, Zhuo Li et al.

ICML 2025arXiv:2408.17253
56
citations
#669

TabM: Advancing tabular deep learning with parameter-efficient ensembling

Yury Gorishniy, Akim Kotelnikov, Artem Babenko

ICLR 2025arXiv:2410.24210
56
citations
#670

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Zhihang Lin, Mingbao Lin, Yuan Xie et al.

NEURIPS 2025arXiv:2503.22342
56
citations
#671

Wonderland: Navigating 3D Scenes from a Single Image

Hanwen Liang, Junli Cao, Vidit Goel et al.

CVPR 2025arXiv:2412.12091
55
citations
#672

Model merging with SVD to tie the Knots

George Stoica, Pratik Ramesh, Boglarka Ecsedi et al.

ICLR 2025arXiv:2410.19735
55
citations
#673

EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation

Rang Meng, Xingyu Zhang, Yuming Li et al.

CVPR 2025arXiv:2411.10061
55
citations
#674

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Tao Wu, Yong Zhang, Xintao Wang et al.

AAAI 2025paperarXiv:2408.13239
55
citations
#675

SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints

Jianhong Bai, Menghan Xia, Xintao WANG et al.

ICLR 2025arXiv:2412.07760
55
citations
#676

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Orr Zohar, Xiaohan Wang, Yann Dubois et al.

CVPR 2025arXiv:2412.10360
55
citations
#677

GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving

Zebin Xing, Xingyu Zhang, Yang Hu et al.

CVPR 2025arXiv:2503.05689
55
citations
#678

Multiple Object Tracking as ID Prediction

Ruopeng Gao, Ji Qi, Limin Wang

CVPR 2025arXiv:2403.16848
55
citations
#679

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma, Xeron Du, Yunran Wang et al.

ICLR 2025arXiv:2410.06526
55
citations
#680

Energy-Based Diffusion Language Models for Text Generation

Minkai Xu, Tomas Geffner, Karsten Kreis et al.

ICLR 2025arXiv:2410.21357
55
citations
#681

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087
55
citations
#682

End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Yingyan Li, Yuqi Wang, Yang Liu et al.

ICCV 2025arXiv:2504.01941
55
citations
#683

LLM Unlearning via Loss Adjustment with Only Forget Data

Yaxuan Wang, Jiaheng Wei, Yuhao Liu et al.

ICLR 2025arXiv:2410.11143
55
citations
#684

Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization

Audrey Huang, Wenhao Zhan, Tengyang Xie et al.

ICLR 2025arXiv:2407.13399
54
citations
#685

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen et al.

ICLR 2025arXiv:2412.09605
54
citations
#686

EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Wangbo Yu, Chaoran Feng, Jianing Li et al.

ICCV 2025arXiv:2405.20224
54
citations
#687

DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent

Taiyi Wang, Zhihao Wu, Jianheng Liu et al.

ICLR 2025arXiv:2410.14803
54
citations
#688

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori et al.

NEURIPS 2025arXiv:2504.18575
54
citations
#689

LoRA-Pro: Are Low-Rank Adapters Properly Optimized?

Zhengbo Wang, Jian Liang, Ran He et al.

ICLR 2025arXiv:2407.18242
54
citations
#690

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang, Xingxing Weng, Jiang Wu et al.

AAAI 2025paperarXiv:2403.20213
54
citations
#691

SEAL: Safety-enhanced Aligned LLM Fine-tuning via Bilevel Data Selection

Han Shen, Pin-Yu Chen, Payel Das et al.

ICLR 2025arXiv:2410.07471
54
citations
#692

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

Yongxin Guo, Jingyu Liu, Mingda Li et al.

ICLR 2025oralarXiv:2410.05643
54
citations
#693

Goku: Flow Based Video Generative Foundation Models

Shoufa Chen, Chongjian GE, Yuqi Zhang et al.

CVPR 2025highlightarXiv:2502.04896
54
citations
#694

MindSearch: Mimicking Human Minds Elicits Deep AI Searcher

Zehui Chen, Kuikun Liu, Qiuchen Wang et al.

ICLR 2025arXiv:2407.20183
54
citations
#695

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

Junfei Wu, Jian Guan, Kaituo Feng et al.

NEURIPS 2025arXiv:2506.09965
54
citations
#696

Inference Scaling for Long-Context Retrieval Augmented Generation

Zhenrui Yue, Honglei Zhuang, Aijun Bai et al.

ICLR 2025arXiv:2410.04343
54
citations
#697

MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation

Chenxi Wang, Xiang Chen, Ningyu Zhang et al.

ICLR 2025arXiv:2410.11779
54
citations
#698

Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking

Heli Ben-Hamu, Itai Gat, Daniel Severo et al.

NEURIPS 2025arXiv:2505.24857
54
citations
#699

Think-on-Graph 2.0: Deep and Faithful Large Language Model Reasoning with Knowledge-guided Retrieval Augmented Generation

Shengjie Ma, Chengjin Xu, Xuhui Jiang et al.

ICLR 2025arXiv:2407.10805
54
citations
#700

BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.

ICLR 2025arXiv:2407.14622
53
citations
#701

Organize the Web: Constructing Domains Enhances Pre-Training Data Curation

Alexander Wettig, Kyle Lo, Sewon Min et al.

ICML 2025arXiv:2502.10341
53
citations
#702

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Guanting Dong, Keming Lu, Chengpeng Li et al.

ICLR 2025arXiv:2406.13542
53
citations
#703

MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling

Yifang Men, Yuan Yao, Miaomiao Cui et al.

CVPR 2025arXiv:2409.16160
53
citations
#704

Arbitrary-steps Image Super-resolution via Diffusion Inversion

Zongsheng Yue, Kang Liao, Chen Change Loy

CVPR 2025arXiv:2412.09013
53
citations
#705

Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

Sucheng Ren, Qihang Yu, Ju He et al.

ICCV 2025arXiv:2502.20388
53
citations
#706

WorldMem: Long-term Consistent World Simulation with Memory

Zeqi Xiao, Yushi LAN, Yifan Zhou et al.

NEURIPS 2025oralarXiv:2504.12369
53
citations
#707

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Noam Razin, Zixuan Wang, Hubert Strauss et al.

NEURIPS 2025spotlightarXiv:2503.15477
53
citations
#708

RADIOv2.5: Improved Baselines for Agglomerative Vision Foundation Models

Greg Heinrich, Mike Ranzinger, Danny Yin et al.

CVPR 2025arXiv:2412.07679
53
citations
#709

NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer

Meng YOU, Zhiyu Zhu, Hui LIU et al.

ICLR 2025arXiv:2405.15364
53
citations
#710

TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining

Wanchao Liang, Tianyu Liu, Less Wright et al.

ICLR 2025
53
citations
#711

A Decade's Battle on Dataset Bias: Are We There Yet?

Zhuang Liu, Kaiming He

ICLR 2025arXiv:2403.08632
53
citations
#712

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

Danny Driess, Jost Springenberg, Brian Ichter et al.

NEURIPS 2025spotlightarXiv:2505.23705
53
citations
#713

DPLM-2: A Multimodal Diffusion Protein Language Model

Xinyou Wang, Zaixiang Zheng, Fei YE et al.

ICLR 2025arXiv:2410.13782
53
citations
#714

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

SHEN FEI, Cong Wang, Junyao Gao et al.

ICML 2025oralarXiv:2502.09533
53
citations
#715

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Yikun Liu, Yajie Zhang, jiayin cai et al.

CVPR 2025arXiv:2412.01720
53
citations
#716

OmniBench: Towards The Future of Universal Omni-Language Models

Yizhi Li, Ge Zhang, Yinghao Ma et al.

NEURIPS 2025arXiv:2409.15272
53
citations
#717

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025arXiv:2504.16072
53
citations
#718

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Xiang Yue, Yueqi Song, Akari Asai et al.

ICLR 2025arXiv:2410.16153
53
citations
#719

Dual Diffusion for Unified Image Generation and Understanding

Zijie Li, Henry Li, Yichun Shi et al.

CVPR 2025arXiv:2501.00289
52
citations
#720

OpenCodeReasoning: Advancing Data Distillation for Competitive Coding

Wasi Uddin Ahmad, Sean Narenthiran, Somshubra Majumdar et al.

COLM 2025paper
52
citations
#721

How Two-Layer Neural Networks Learn, One (Giant) Step at a Time

Yatin Dandi, Florent Krzakala, Bruno Loureiro et al.

ICLR 2025arXiv:2305.18270
52
citations
#722

An Undetectable Watermark for Generative Image Models

Samuel Gunn, Xuandong Zhao, Dawn Song

ICLR 2025arXiv:2410.07369
52
citations
#723

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

Hayk Manukyan, Andranik Sargsyan, Barsegh Atanyan et al.

ICLR 2025arXiv:2312.14091
52
citations
#724

Calibrating Large Language Models with Sample Consistency

Qing Lyu, Kumar Shridhar, Chaitanya Malaviya et al.

AAAI 2025paperarXiv:2402.13904
52
citations
#725

Timer-XL: Long-Context Transformers for Unified Time Series Forecasting

Yong Liu, Guo Qin, Xiangdong Huang et al.

ICLR 2025oralarXiv:2410.04803
52
citations
#726

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao et al.

ICCV 2025arXiv:2411.13807
52
citations
#727

Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Zhenting Qi, Hanlin Zhang, Eric P Xing et al.

ICLR 2025arXiv:2402.17840
52
citations
#728

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Andy (DiJia) Su, Hanlin Zhu, Yingchen Xu et al.

ICML 2025arXiv:2502.03275
52
citations
#729

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Muzhi Dai, Chenxu Yang, Qingyi Si

NEURIPS 2025oralarXiv:2505.07686
52
citations
#730

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

Hao Chen, Yujin Han, Fangyi Chen et al.

ICML 2025spotlightarXiv:2502.03444
52
citations
#731

Bootstrapping Language Models with DPO Implicit Rewards

Changyu Chen, Zichen Liu, Chao Du et al.

ICLR 2025arXiv:2406.09760
51
citations
#732

TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers

Chuanrui Zhang, Yingshuang Zou, Zhuoling Li et al.

AAAI 2025paperarXiv:2408.13770
51
citations
#733

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM’s Reasoning Capability

Zicheng Lin, Tian Liang, Jiahao Xu et al.

ICML 2025arXiv:2411.19943
51
citations
#734

RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval

Kaiyue Wen, Xingyu Dang, Kaifeng Lyu

ICLR 2025arXiv:2402.18510
51
citations
#735

Language Model Can Listen While Speaking

Ziyang Ma, Yakun Song, Chenpeng Du et al.

AAAI 2025paperarXiv:2408.02622
51
citations
#736

DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang

ICCV 2025arXiv:2412.18607
51
citations
#737

Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

Siyan Zhao, Mingyi Hong, Yang Liu et al.

ICLR 2025arXiv:2502.09597
51
citations
#738

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Yongliang Wu, Shiji Zhou, Mingzhuo Yang et al.

AAAI 2025paperarXiv:2405.15304
51
citations
#739

StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization

Zhuoqun Li, Xuanang Chen, Haiyang Yu et al.

ICLR 2025arXiv:2410.08815
51
citations
#740

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels

Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.

ICLR 2025arXiv:2406.09415
51
citations
#741

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.

ICLR 2025arXiv:2410.08847
51
citations
#742

FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec et al.

COLM 2025paperarXiv:2506.20920
51
citations
#743

EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris et al.

ICML 2025arXiv:2502.09509
51
citations
#744

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks

Guibin Zhang, Yanwei Yue, Xiangguo Sun et al.

ICML 2025spotlightarXiv:2410.11782
51
citations
#745

NETS: A Non-equilibrium Transport Sampler

Michael Albergo, Eric Vanden-Eijnden

ICML 2025arXiv:2410.02711
51
citations
#746

Sonic: Shifting Focus to Global Audio Perception in Portrait Animation

Xiaozhong Ji, Xiaobin Hu, Zhihong Xu et al.

CVPR 2025arXiv:2411.16331
51
citations
#747

FlexTok: Resampling Images into 1D Token Sequences of Flexible Length

Roman Bachmann, Jesse Allardice, David Mizrahi et al.

ICML 2025arXiv:2502.13967
51
citations
#748

SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment

Katrin Renz, Long Chen, Elahe Arani et al.

CVPR 2025highlightarXiv:2503.09594
51
citations
#749

BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments

Yusuf Roohani, Andrew Lee, Qian Huang et al.

ICLR 2025arXiv:2405.17631
51
citations
#750

Does Spatial Cognition Emerge in Frontier Models?

Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl et al.

ICLR 2025arXiv:2410.06468
51
citations
#751

Lean-STaR: Learning to Interleave Thinking and Proving

Haohan Lin, Zhiqing Sun, Sean Welleck et al.

ICLR 2025arXiv:2407.10040
51
citations
#752

Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design

Zhi Zheng, Zhuoliang Xie, Zhenkun Wang et al.

ICML 2025arXiv:2501.08603
50
citations
#753

Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Chongjie Ye, Yushuang Wu, Ziteng Lu et al.

ICCV 2025arXiv:2503.22236
50
citations
#754

Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Tian Ye, Zicheng Xu, Yuanzhi Li et al.

ICLR 2025arXiv:2408.16293
50
citations
#755

RRM: Robust Reward Model Training Mitigates Reward Hacking

Tianqi Liu, Wei Xiong, Jie Ren et al.

ICLR 2025arXiv:2409.13156
50
citations
#756

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel et al.

ICCV 2025arXiv:2504.11171
50
citations
#757

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Han Lin, Jaemin Cho, Abhay Zala et al.

ICLR 2025oralarXiv:2404.09967
50
citations
#758

Learn Your Reference Model for Real Good Alignment

Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov et al.

ICLR 2025arXiv:2404.09656
50
citations
#759

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu et al.

AAAI 2025paperarXiv:2404.12353
50
citations
#760

Aether: Geometric-Aware Unified World Modeling

Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.

ICCV 2025arXiv:2503.18945
50
citations
#761

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Fan-Yun Sun, Weiyu Liu, Siyi Gu et al.

CVPR 2025arXiv:2412.02193
50
citations
#762

All-atom Diffusion Transformers: Unified generative modelling of molecules and materials

Chaitanya Joshi, Xiang Fu, Yi-Lun Liao et al.

ICML 2025arXiv:2503.03965
50
citations
#763

MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

Jinfeng Xu, Zheyu Chen, Shuo Yang et al.

AAAI 2025paperarXiv:2402.19407
50
citations
#764

Eliminating Position Bias of Language Models: A Mechanistic Approach

Ziqi Wang, Hanlin Zhang, Xiner Li et al.

ICLR 2025arXiv:2407.01100
50
citations
#765

Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

Fu-Yun Wang, Ling Yang, Zhaoyang Huang et al.

ICLR 2025arXiv:2410.07303
50
citations
#766

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks

Matthew Chang, Gunjan Chhablani, Alexander Clegg et al.

ICLR 2025oralarXiv:2411.00081
50
citations
#767

GAM-Agent: Game-Theoretic and Uncertainty-Aware Collaboration for Complex Visual Reasoning

jusheng zhang, Yijia Fan, Wenjun Lin et al.

NEURIPS 2025arXiv:2505.23399
50
citations
#768

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Yang Chen, Zhuolin Yang, Zihan Liu et al.

NEURIPS 2025arXiv:2505.16400
50
citations
#769

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation

Haotong Lin, Sida Peng, Jingxiao Chen et al.

CVPR 2025arXiv:2412.14015
49
citations
#770

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Rui Chen, Jianfeng Zhang, Yixun Liang et al.

CVPR 2025arXiv:2412.17808
49
citations
#771

MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs

Xuannan Liu, Zekun Li, Pei Li et al.

ICLR 2025arXiv:2406.08772
49
citations
#772

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu et al.

NEURIPS 2025oralarXiv:2503.13377
49
citations
#773

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Qingyun Li, Zhe Chen, Weiyun Wang et al.

ICLR 2025arXiv:2406.08418
49
citations
#774

Catastrophic Failure of LLM Unlearning via Quantization

Zhiwei Zhang, Fali Wang, Xiaomin Li et al.

ICLR 2025arXiv:2410.16454
49
citations
#775

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang et al.

ICML 2025arXiv:2501.05452
49
citations
#776

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur et al.

ICLR 2025arXiv:2412.15287
49
citations
#777

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin et al.

ICCV 2025arXiv:2411.02038
49
citations
#778

LLM Generated Persona is a Promise with a Catch

Leon Li, Haozhe Chen, Hongseok Namkoong et al.

NEURIPS 2025arXiv:2503.16527
49
citations
#779

Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach

Lingchen Sun, Rongyuan Wu, Zhiyuan Ma et al.

CVPR 2025arXiv:2412.03017
49
citations
#780

Atom of Thoughts for Markov LLM Test-Time Scaling

Fengwei Teng, Quan Shi, Zhaoyang Yu et al.

NEURIPS 2025arXiv:2502.12018
49
citations
#781

X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

Salman Rahman, Liwei Jiang, James Shiffer et al.

COLM 2025paperarXiv:2504.13203
49
citations
#782

Vision Language Models are In-Context Value Learners

Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.

ICLR 2025oralarXiv:2411.04549
49
citations
#783

ALLaM: Large Language Models for Arabic and English

M Saiful Bari, Yazeed Alnumay, Norah Alzahrani et al.

ICLR 2025arXiv:2407.15390
49
citations
#784

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

Arijit Ray, Jiafei Duan, Ellis L Brown II et al.

COLM 2025paperarXiv:2412.07755
49
citations
#785

Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

Yuyang Ye, Zhi Zheng, Yishan Shen et al.

AAAI 2025paperarXiv:2408.09698
49
citations
#786

SPA-BENCH: A COMPREHENSIVE BENCHMARK FOR SMARTPHONE AGENT EVALUATION

Jingxuan Chen, Derek Yuen, Bin Xie et al.

ICLR 2025arXiv:2410.15164
49
citations
#787

Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

Riccardo Grazzi, Julien Siems, Arber Zela et al.

ICLR 2025arXiv:2411.12537
49
citations
#788

NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions

Weizhe Yuan, Jane Yu, Song Jiang et al.

NEURIPS 2025arXiv:2502.13124
49
citations
#789

Learning Adaptive Parallel Reasoning with Language Models

Jiayi Pan, Xiuyu Li, Long Lian et al.

COLM 2025paperarXiv:2504.15466
49
citations
#790

EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers

Daiheng Gao, Shilin Lu, Wenbo Zhou et al.

ICML 2025arXiv:2412.20413
49
citations
#791

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Yibin Wang, li zhimin, Yuhang Zang et al.

NEURIPS 2025arXiv:2505.03318
49
citations
#792

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng et al.

CVPR 2025highlightarXiv:2412.06699
49
citations
#793

The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control

Ruili Feng, Han Zhang, Zhilei Shu et al.

NEURIPS 2025arXiv:2412.03568
48
citations
#794

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen, Yunhao Gou, Runhui Huang et al.

CVPR 2025arXiv:2409.18042
48
citations
#795

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Fanqing Meng, Jin Wang, Chuanhao Li et al.

ICLR 2025arXiv:2408.02718
48
citations
#796

STAIR: Improving Safety Alignment with Introspective Reasoning

Yichi Zhang, Siyuan Zhang, Yao Huang et al.

ICML 2025oralarXiv:2502.02384
48
citations
#797

Scaling Mesh Generation via Compressive Tokenization

Haohan Weng, Zibo Zhao, Biwen Lei et al.

CVPR 2025arXiv:2411.07025
48
citations
#798

Affordances-Oriented Planning Using Foundation Models for Continuous Vision-Language Navigation

Jiaqi Chen, Bingqian Lin, Xinmin Liu et al.

AAAI 2025paperarXiv:2407.05890
48
citations
#799

Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models

Keisuke Kamahori, Tian Tang, Yile Gu et al.

ICLR 2025arXiv:2402.07033
48
citations
#800

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang et al.

ICLR 2025arXiv:2312.10300
48
citations