Most Cited 2025 "image-text interleaved architecture" Papers

22,274 papers found • Page 3 of 112

#401

Soft Merging of Experts with Adaptive Routing

Haokun Liu, Muqeeth Mohammed, Colin Raffel

ICLR 2025arXiv:2306.03745
82
citations
#402

Diffusion-Based Planning for Autonomous Driving with Flexible Guidance

Yinan Zheng, Ruiming Liang, Kexin ZHENG et al.

ICLR 2025arXiv:2501.15564
82
citations
#403

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.

CVPR 2025arXiv:2411.18673
82
citations
#404

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu et al.

NEURIPS 2025arXiv:2504.03601
82
citations
#405

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Yong Lin, Shange Tang, Bohan Lyu et al.

COLM 2025paperarXiv:2502.07640
82
citations
#406

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin et al.

NEURIPS 2025arXiv:2410.07163
81
citations
#407

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

Wenbin Wang, Liang Ding, Minyan Zeng et al.

AAAI 2025paperarXiv:2408.15556
81
citations
#408

Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Wu, Rishi Shah, Jing Yu Koh et al.

ICLR 2025arXiv:2406.12814
81
citations
#409

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Xiyao Wang, Zhengyuan Yang, Chao Feng et al.

NEURIPS 2025spotlightarXiv:2504.07934
81
citations
#410

D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement

Yansong Peng, Hebei Li, Peixi Wu et al.

ICLR 2025arXiv:2410.13842
81
citations
#411

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin, James Wilken-Smith, Tomáš Dulka et al.

NEURIPS 2025oralarXiv:2409.14507
81
citations
#412

Real-Time Video Generation with Pyramid Attention Broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.

ICLR 2025arXiv:2408.12588
81
citations
#413

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

Zeyu Wang, Chen Li, Huiying Xu et al.

AAAI 2025paperarXiv:2406.05835
80
citations
#414

Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

Xinyi Wang, Antonis Antoniades, Yanai Elazar et al.

ICLR 2025arXiv:2407.14985
80
citations
#415

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Jiahao Cui, Hui Li, Yao Yao et al.

ICLR 2025oralarXiv:2410.07718
80
citations
#416

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua et al.

ICLR 2025arXiv:2502.13595
80
citations
#417

Improving Instruction-Following in Language Models through Activation Steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi et al.

ICLR 2025arXiv:2410.12877
80
citations
#418

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.

ICLR 2025arXiv:2407.08739
79
citations
#419

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.

ICLR 2025arXiv:2502.17416
79
citations
#420

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar et al.

ICLR 2025arXiv:2403.08540
79
citations
#421

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Tengyang Xie, Dylan Foster, Akshay Krishnamurthy et al.

ICLR 2025arXiv:2405.21046
79
citations
#422

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa et al.

CVPR 2025arXiv:2412.15322
79
citations
#423

Eliciting Human Preferences with Language Models

Belinda Li, Alex Tamkin, Noah Goodman et al.

ICLR 2025oralarXiv:2310.11589
79
citations
#424

HVI: A New Color Space for Low-light Image Enhancement

Qingsen Yan, Yixu Feng, Cheng Zhang et al.

CVPR 2025arXiv:2502.20272
79
citations
#425

UniTok: a Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma, Yi Jiang, Junfeng Wu et al.

NEURIPS 2025spotlightarXiv:2502.20321
79
citations
#426

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang et al.

NEURIPS 2025arXiv:2505.15781
79
citations
#427

ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Chengsen Wang, Qi Qi, Jingyu Wang et al.

AAAI 2025paperarXiv:2412.11376
78
citations
#428

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

CVPR 2025arXiv:2501.12380
78
citations
#429

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822
78
citations
#430

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.

ICLR 2025arXiv:2407.08223
78
citations
#431

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen et al.

ICLR 2025
78
citations
#432

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang et al.

NEURIPS 2025spotlightarXiv:2504.05812
78
citations
#433

VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool

Chia-Tung Ho, Haoxing Ren, Brucek Khailany

AAAI 2025paperarXiv:2408.08927
78
citations
#434

Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini, Mustafa Shukor, Xiujun Li et al.

CVPR 2025highlightarXiv:2411.14402
77
citations
#435

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He et al.

ICLR 2025arXiv:2406.08481
77
citations
#436

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao, Yuhui Xu, Hanze Dong et al.

ICML 2025arXiv:2501.19324
77
citations
#437

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng et al.

NEURIPS 2025oralarXiv:2505.06708
77
citations
#438

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

CVPR 2025highlightarXiv:2412.07774
77
citations
#439

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.

NEURIPS 2025spotlightarXiv:2505.23747
77
citations
#440

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Zhipei Xu, Xuanyu Zhang, Runyi Li et al.

ICLR 2025arXiv:2410.02761
77
citations
#441

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025arXiv:2412.03632
76
citations
#442

Fine-tuning can cripple your foundation model; preserving features may be the solution

Philip Torr, Puneet Dokania, Jishnu Mukhoti et al.

ICLR 2025arXiv:2308.13320
76
citations
#443

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos et al.

ICLR 2025arXiv:2410.06205
76
citations
#444

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

ICML 2025arXiv:2410.05363
76
citations
#445

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi, Seyed Mehran Kazemi, Anton Tsitsulin et al.

ICLR 2025oralarXiv:2406.09170
76
citations
#446

Simple Guidance Mechanisms for Discrete Diffusion Models

Yair Schiff, Subham Sahoo, Hao Phung et al.

ICLR 2025arXiv:2412.10193
76
citations
#447

InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales

Zhepei Wei, Wei-Lin Chen, Yu Meng

ICLR 2025arXiv:2406.13629
76
citations
#448

RWKV-7 "Goose" with Expressive Dynamic State Evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein et al.

COLM 2025paper
76
citations
#449

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi et al.

ICML 2025oralarXiv:2502.00816
76
citations
#450

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Sang Choe, Hwijeen Ahn, Juhan Bae et al.

NEURIPS 2025arXiv:2405.13954
76
citations
#451

HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation

Yi Li, Yuquan Deng, Jesse Zhang et al.

ICLR 2025arXiv:2502.05485
75
citations
#452

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber, Lijun Yu, Qihang Yu et al.

ICLR 2025arXiv:2409.16211
75
citations
#453

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Zhen Ye, Peiwen Sun, Jiahe Lei et al.

AAAI 2025paperarXiv:2408.17175
75
citations
#454

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025arXiv:2503.07027
75
citations
#455

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Zhao et al.

NEURIPS 2025arXiv:2506.13757
75
citations
#456

Diffusion Adversarial Post-Training for One-Step Video Generation

Shanchuan Lin, Xin Xia, Yuxi Ren et al.

ICML 2025arXiv:2501.08316
75
citations
#457

UMA: A Family of Universal Models for Atoms

Brandon Wood, Misko Dzamba, Xiang Fu et al.

NEURIPS 2025spotlightarXiv:2506.23971
75
citations
#458

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.

CVPR 2025arXiv:2412.14167
75
citations
#459

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz et al.

ICML 2025oralarXiv:2502.06764
75
citations
#460

Cradle: Empowering Foundation Agents towards General Computer Control

Weihao Tan, Wentao Zhang, Xinrun Xu et al.

ICML 2025arXiv:2403.03186
75
citations
#461

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.

ICLR 2025arXiv:2411.13543
74
citations
#462

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Kiho Park, Yo Joong Choe, Yibo Jiang et al.

ICLR 2025arXiv:2406.01506
74
citations
#463

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025arXiv:2501.04003
74
citations
#464

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

shiduo zhang, Zhe Xu, Peiju Liu et al.

ICCV 2025arXiv:2412.18194
74
citations
#465

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Wei Chow, Jiageng Mao, Boyi Li et al.

ICLR 2025arXiv:2501.16411
74
citations
#466

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

CVPR 2025arXiv:2502.21271
73
citations
#467

Planning in Natural Language Improves LLM Search for Code Generation

Evan Wang, Federico Cassano, Catherine Wu et al.

ICLR 2025arXiv:2409.03733
73
citations
#468

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Yunfei Xie, Ce Zhou, Lang Gao et al.

ICLR 2025arXiv:2408.02900
73
citations
#469

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu et al.

COLM 2025paperarXiv:2504.00906
73
citations
#470

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li et al.

NEURIPS 2025arXiv:2411.13093
73
citations
#471

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine et al.

ICML 2025spotlightarXiv:2502.12118
73
citations
#472

FreDF: Learning to Forecast in the Frequency Domain

Hao Wang, Lichen Pan, Yuan Shen et al.

ICLR 2025arXiv:2402.02399
73
citations
#473

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song et al.

NEURIPS 2025arXiv:2501.00321
73
citations
#474

Offline Actor-Critic for Average Reward MDPs

William Powell, Jeongyeol Kwon, Qiaomin Xie et al.

NEURIPS 2025
73
citations
#475

DiT4Edit: Diffusion Transformer for Image Editing

Kunyu Feng, Yue Ma, Bingyuan Wang et al.

AAAI 2025paperarXiv:2411.03286
73
citations
#476

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong et al.

NEURIPS 2025arXiv:2507.02546
72
citations
#477

MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

Yiwen Chen, Yikai Wang, Yihao Luo et al.

ICCV 2025arXiv:2408.02555
72
citations
#478

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan et al.

ICLR 2025arXiv:2405.20541
72
citations
#479

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos et al.

ICML 2025spotlightarXiv:2502.07529
72
citations
#480

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Melissa Hall, Michal Drozdzal, Oscar Mañas et al.

ICLR 2025arXiv:2403.17804
71
citations
#481

XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang et al.

ICML 2025arXiv:2503.16428
71
citations
#482

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Xu Liu, Juncheng Liu, Gerald Woo et al.

ICML 2025arXiv:2410.10469
71
citations
#483

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

Guofeng Feng, Siyan Chen, Rong Fu et al.

CVPR 2025arXiv:2408.07967
71
citations
#484

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao et al.

COLM 2025paperarXiv:2504.07086
71
citations
#485

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Yucheng Zhou, Jianbing Shen, Yu Cheng

ICLR 2025
70
citations
#486

Thinkless: LLM Learns When to Think

Gongfan Fang, Xinyin Ma, Xinchao Wang

NEURIPS 2025arXiv:2505.13379
70
citations
#487

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Antonis Antoniades, Albert Örwall, Kexun Zhang et al.

ICLR 2025arXiv:2410.20285
70
citations
#488

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, Liwei Wang

CVPR 2025arXiv:2412.00493
70
citations
#489

DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.

ICLR 2025oralarXiv:2503.07656
70
citations
#490

Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics

Yaniv Nikankin, Anja Reusch, Aaron Mueller et al.

ICLR 2025arXiv:2410.21272
70
citations
#491

GameGen-X: Interactive Open-world Game Video Generation

Haoxuan Che, Xuanhua He, Quande Liu et al.

ICLR 2025arXiv:2411.00769
70
citations
#492

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

ICCV 2025highlightarXiv:2501.08325
70
citations
#493

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu, Tianjie Zhang, Yu Gu et al.

ICLR 2025arXiv:2408.06327
70
citations
#494

LoRA vs Full Fine-tuning: An Illusion of Equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba et al.

NEURIPS 2025arXiv:2410.21228
70
citations
#495

GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li et al.

ICML 2025
69
citations
#496

Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments

Hongjin SU, Ruoxi Sun, Jinsung Yoon et al.

ICLR 2025arXiv:2501.10893
69
citations
#497

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

ICLR 2025arXiv:2410.05317
69
citations
#498

HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Keqiang Sun, Xiaoshi Wu et al.

ICCV 2025arXiv:2508.03789
69
citations
#499

MagicPIG: LSH Sampling for Efficient LLM Generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.

ICLR 2025arXiv:2410.16179
69
citations
#500

Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko, Nicolas Flammarion

ICLR 2025arXiv:2407.11969
69
citations
#501

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li, wei yuancheng, Zhihui Xie et al.

CVPR 2025highlightarXiv:2411.17451
69
citations
#502

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li et al.

ICML 2025spotlightarXiv:2502.09838
69
citations
#503

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li, Haoqin Tu, Mude Hui et al.

ICML 2025arXiv:2406.08478
69
citations
#504

Augmenting Math Word Problems via Iterative Question Composing

Haoxiong Liu, Yifan Zhang, Yifan Luo et al.

AAAI 2025paperarXiv:2401.09003
69
citations
#505

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Cheng Yang, Chufan Shi, Yaxin Liu et al.

ICLR 2025arXiv:2406.09961
69
citations
#506

CycleResearcher: Improving Automated Research via Automated Review

Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.

ICLR 2025arXiv:2411.00816
69
citations
#507

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Ke Yang, Yao Liu, Sapana Chaudhary et al.

ICLR 2025arXiv:2410.13825
69
citations
#508

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer, Uriel Singer, Amit Zohar et al.

ICML 2025oralarXiv:2502.02492
68
citations
#509

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong, Xiao Dong, Haoxiang Li et al.

ICLR 2025arXiv:2407.15886
68
citations
#510

CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching

Xingjian Wu, Xiangfei Qiu, Zhengyu Li et al.

ICLR 2025arXiv:2410.12261
68
citations
#511

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen et al.

ICLR 2025arXiv:2410.01756
68
citations
#512

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Fushuo Huo, Wenchao Xu, Zhong Zhang et al.

ICLR 2025arXiv:2408.02032
68
citations
#513

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Weize Chen, Ziming You, Ran Li et al.

ICLR 2025arXiv:2407.07061
68
citations
#514

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Zhongxiang Sun, Xiaoxue Zang, Kai Zheng et al.

ICLR 2025arXiv:2410.11414
68
citations
#515

Scaling Laws for Precision

Tanishq Kumar, Zachary Ankner, Benjamin Spector et al.

ICLR 2025arXiv:2411.04330
68
citations
#516

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025arXiv:2410.08202
68
citations
#517

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang et al.

ICLR 2025arXiv:2406.02540
68
citations
#518

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Zhenyu Hou, Xin Lv, Rui Lu et al.

ICML 2025arXiv:2501.11651
68
citations
#519

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Yongting Zhang, Lu Chen, Guodong Zheng et al.

CVPR 2025arXiv:2406.12030
68
citations
#520

Long Context Compression with Activation Beacon

Peitian Zhang, Zheng Liu, Shitao Xiao et al.

ICLR 2025arXiv:2401.03462
68
citations
#521

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025oralarXiv:2311.18232
67
citations
#522

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K Zhang, Neil Perry, Riya Dulepet et al.

ICLR 2025arXiv:2408.08926
67
citations
#523

One-Minute Video Generation with Test-Time Training

Jiarui Xu, Shihao Han, Karan Dalal et al.

CVPR 2025arXiv:2504.05298
67
citations
#524

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Nick Jiang, Anish Kachinthaya, Suzanne Petryk et al.

ICLR 2025arXiv:2410.02762
67
citations
#525

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang et al.

ICLR 2025oralarXiv:2410.19702
67
citations
#526

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan et al.

ICLR 2025arXiv:2409.20566
67
citations
#527

Process Reward Model with Q-value Rankings

Wendi Li, Yixuan Li

ICLR 2025arXiv:2410.11287
67
citations
#528

Learning Dynamics of LLM Finetuning

YI REN, Danica Sutherland

ICLR 2025arXiv:2407.10490
67
citations
#529

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.

ICLR 2025arXiv:2406.07358
67
citations
#530

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025arXiv:2503.19755
66
citations
#531

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

Yakun Song, Zhuo Chen, Xiaofei Wang et al.

AAAI 2025paperarXiv:2401.07333
66
citations
#532

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Zhenglin Huang, Jinwei Hu, Yiwei He et al.

CVPR 2025arXiv:2412.04292
66
citations
#533

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.

ICLR 2025arXiv:2408.16737
66
citations
#534

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino, Michele Wang, Tejal Patwardhan et al.

ICML 2025oralarXiv:2502.12115
66
citations
#535

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

ICLR 2025arXiv:2406.07548
66
citations
#536

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang et al.

ICML 2025arXiv:2411.10958
66
citations
#537

XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

Linzheng Chai, Jian Yang, Tao Sun et al.

AAAI 2025paperarXiv:2401.07037
66
citations
#538

Inductive Moment Matching

Linqi (Alex) Zhou, Stefano Ermon, Jiaming Song

ICML 2025oralarXiv:2503.07565
65
citations
#539

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

Canyu Zhao, Mingyu Liu, Wen Wang et al.

ICLR 2025arXiv:2407.16655
65
citations
#540

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

ICML 2025spotlightarXiv:2411.15114
65
citations
#541

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang et al.

ICCV 2025arXiv:2412.07825
65
citations
#542

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025arXiv:2405.08366
65
citations
#543

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae, Namyoung Kim, Kai Ong et al.

ICLR 2025arXiv:2410.13232
65
citations
#544

Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

Yiming Huang, Xiao Liu, Yeyun Gong et al.

AAAI 2025paperarXiv:2403.02333
65
citations
#545

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.

ICML 2025spotlightarXiv:2410.21465
65
citations
#546

Flow Q-Learning

Seohong Park, Qiyang Li, Sergey Levine

ICML 2025arXiv:2502.02538
65
citations
#547

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.

ICLR 2025arXiv:2409.07703
65
citations
#548

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang et al.

ICLR 2025arXiv:2411.05361
64
citations
#549

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Shuyuan Tu, Zhen Xing, Xintong Han et al.

CVPR 2025arXiv:2411.17697
64
citations
#550

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Guibin Zhang, Yanwei Yue, Zhixun Li et al.

ICLR 2025oralarXiv:2410.02506
64
citations
#551

Fast Video Generation with Sliding Tile Attention

Peiyuan Zhang, Yongqi Chen, Runlong Su et al.

ICML 2025oralarXiv:2502.04507
64
citations
#552

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Fei Shen, Hu Ye, Sibo Liu et al.

AAAI 2025paperarXiv:2407.02482
64
citations
#553

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu et al.

CVPR 2025arXiv:2412.05435
64
citations
#554

An analytic theory of creativity in convolutional diffusion models

Mason Kamb, Surya Ganguli

ICML 2025oralarXiv:2412.20292
64
citations
#555

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.

ICLR 2025arXiv:2408.11049
64
citations
#556

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.

ICCV 2025arXiv:2312.12491
64
citations
#557

Task Singular Vectors: Reducing Task Interference in Model Merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.

CVPR 2025arXiv:2412.00081
64
citations
#558

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin et al.

AAAI 2025paperarXiv:2409.10197
64
citations
#559

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Shuai Tan, Biao Gong, Xiang Wang et al.

ICLR 2025oralarXiv:2410.10306
64
citations
#560

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora et al.

ICML 2025arXiv:2502.10517
63
citations
#561

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.

CVPR 2025arXiv:2408.00653
63
citations
#562

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Vighnesh Subramaniam, Yilun Du, Joshua B Tenenbaum et al.

ICLR 2025arXiv:2501.05707
63
citations
#563

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.

CVPR 2025arXiv:2412.01827
63
citations
#564

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li et al.

ICML 2025arXiv:2502.06453
63
citations
#565

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong et al.

CVPR 2025highlightarXiv:2409.12957
63
citations
#566

Matryoshka Multimodal Models

Mu Cai, Jianwei Yang, Jianfeng Gao et al.

ICLR 2025arXiv:2405.17430
63
citations
#567

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang, Tao Yu, Haochen Tian et al.

ICML 2025arXiv:2502.10391
63
citations
#568

Why do LLMs attend to the first token?

Federico Barbero, Alvaro Arroyo, Xiangming Gu et al.

COLM 2025paperarXiv:2504.02732
63
citations
#569

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Yao Xiao, Tingfa Xu, Yu Xin et al.

AAAI 2025paperarXiv:2504.20670
62
citations
#570

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo et al.

ICLR 2025arXiv:2502.20766
62
citations
#571

CSGO: Content-Style Composition in Text-to-Image Generation

Peng Xing, Haofan Wang, Yanpeng Sun et al.

NEURIPS 2025arXiv:2408.16766
62
citations
#572

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.

CVPR 2025arXiv:2501.08331
62
citations
#573

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang, Yang Lin, Jing Lin et al.

ICLR 2025arXiv:2407.15891
62
citations
#574

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NEURIPS 2025arXiv:2504.07954
62
citations
#575

AIOS: LLM Agent Operating System

Kai Mei, Xi Zhu, Wujiang Xu et al.

COLM 2025paperarXiv:2403.16971
62
citations
#576

PnP-Flow: Plug-and-Play Image Restoration with Flow Matching

Ségolène Martin, Anne Gagneux, Paul Hagemann et al.

ICLR 2025arXiv:2410.02423
62
citations
#577

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

Jaehong Yoon, Shoubin Yu, Vaidehi Ramesh Patil et al.

ICLR 2025arXiv:2410.12761
62
citations
#578

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

Weiqi Li, Xuanyu Zhang, Shijie Zhao et al.

NEURIPS 2025spotlightarXiv:2503.22679
62
citations
#579

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Shicong Cen, Jincheng Mei, Katayoon Goshvadi et al.

ICLR 2025arXiv:2405.19320
62
citations
#580

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

ICLR 2025arXiv:2411.19842
62
citations
#581

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou et al.

COLM 2025paperarXiv:2402.05136
62
citations
#582

AgentSquare: Automatic LLM Agent Search in Modular Design Space

Yu Shang, Yu Li, Keyu Zhao et al.

ICLR 2025arXiv:2410.06153
61
citations
#583

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan et al.

AAAI 2025paperarXiv:2404.14233
61
citations
#584

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.

CVPR 2025arXiv:2503.01774
61
citations
#585

Proteina: Scaling Flow-based Protein Structure Generative Models

Tomas Geffner, Kieran Didi, Zuobai Zhang et al.

ICLR 2025arXiv:2503.00710
61
citations
#586

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen, Shijie Chen, Yuting Ning et al.

ICLR 2025arXiv:2410.05080
61
citations
#587

Building Math Agents with Multi-Turn Iterative Preference Learning

Wei Xiong, Chengshuai Shi, Jiaming Shen et al.

ICLR 2025arXiv:2409.02392
61
citations
#588

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu et al.

NEURIPS 2025arXiv:2504.02605
61
citations
#589

SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan et al.

ICLR 2025arXiv:2306.00148
61
citations
#590

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.

CVPR 2025highlightarXiv:2503.01661
61
citations
#591

Automatically Interpreting Millions of Features in Large Language Models

Gonçalo Paulo, Alex Mallen, Caden Juang et al.

ICML 2025arXiv:2410.13928
61
citations
#592

C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

Chuangchuang Tan, Renshuai Tao, Huan Liu et al.

AAAI 2025paperarXiv:2408.09647
61
citations
#593

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi et al.

NEURIPS 2025arXiv:2507.04447
61
citations
#594

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321
61
citations
#595

Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

Chunming He, Chengyu Fang, Yulun Zhang et al.

ICLR 2025arXiv:2311.11638
61
citations
#596

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin, Richard Tucker, Zhengqi Li et al.

CVPR 2025arXiv:2412.09621
60
citations
#597

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.

ICLR 2025arXiv:2409.01586
60
citations
#598

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.

AAAI 2025paperarXiv:2408.08640
60
citations
#599

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

NEURIPS 2025spotlightarXiv:2505.13227
60
citations
#600

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Rongyao Fang, Chengqi Duan, Kun Wang et al.

NEURIPS 2025
60
citations