Most Cited 2025 "image-text interleaved architecture" Papers

22,274 papers found • Page 3 of 112

Filters:Most Cited 2025 image-text interleaved architecture Clear all

Conference

AAAI 2025 (3,028)COLM 2025 (418)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,558)oral (1,594)spotlight (1,421)highlight (975)

#401

Soft Merging of Experts with Adaptive Routing

Haokun Liu, Muqeeth Mohammed, Colin Raffel

ICLR 2025arXiv:2306.03745

citations

#402

Diffusion-Based Planning for Autonomous Driving with Flexible Guidance

Yinan Zheng, Ruiming Liang, Kexin ZHENG et al.

ICLR 2025arXiv:2501.15564

citations

#403

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.

CVPR 2025arXiv:2411.18673

citations

#404

APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

Akshara Prabhakar, Zuxin Liu, Ming Zhu et al.

NEURIPS 2025arXiv:2504.03601

citations

#405

Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving

Yong Lin, Shange Tang, Bohan Lyu et al.

COLM 2025paperarXiv:2502.07640

citations

#406

Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

Chongyu Fan, Jiancheng Liu, Licong Lin et al.

NEURIPS 2025arXiv:2410.07163

citations

#407

Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models

Wenbin Wang, Liang Ding, Minyan Zeng et al.

AAAI 2025paperarXiv:2408.15556

citations

#408

Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Wu, Rishi Shah, Jing Yu Koh et al.

ICLR 2025arXiv:2406.12814

citations

#409

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

Xiyao Wang, Zhengyuan Yang, Chao Feng et al.

NEURIPS 2025spotlightarXiv:2504.07934

citations

#410

D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement

Yansong Peng, Hebei Li, Peixi Wu et al.

ICLR 2025arXiv:2410.13842

citations

#411

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

David Chanin, James Wilken-Smith, Tomáš Dulka et al.

NEURIPS 2025oralarXiv:2409.14507

citations

#412

Real-Time Video Generation with Pyramid Attention Broadcast

Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.

ICLR 2025arXiv:2408.12588

citations

#413

Mamba YOLO: A Simple Baseline for Object Detection with State Space Model

Zeyu Wang, Chen Li, Huiying Xu et al.

AAAI 2025paperarXiv:2406.05835

citations

#414

Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

Xinyi Wang, Antonis Antoniades, Yanai Elazar et al.

ICLR 2025arXiv:2407.14985

citations

#415

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

Jiahao Cui, Hui Li, Yao Yao et al.

ICLR 2025oralarXiv:2410.07718

citations

#416

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua et al.

ICLR 2025arXiv:2502.13595

citations

#417

Improving Instruction-Following in Language Models through Activation Steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi et al.

ICLR 2025arXiv:2410.12877

citations

#418

MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.

ICLR 2025arXiv:2407.08739

citations

#419

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.

ICLR 2025arXiv:2502.17416

citations

#420

Language models scale reliably with over-training and on downstream tasks

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar et al.

ICLR 2025arXiv:2403.08540

citations

#421

Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF

Tengyang Xie, Dylan Foster, Akshay Krishnamurthy et al.

ICLR 2025arXiv:2405.21046

citations

#422

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Ho Kei Cheng, Masato Ishii, Akio Hayakawa et al.

CVPR 2025arXiv:2412.15322

citations

#423

Eliciting Human Preferences with Language Models

Belinda Li, Alex Tamkin, Noah Goodman et al.

ICLR 2025oralarXiv:2310.11589

citations

#424

HVI: A New Color Space for Low-light Image Enhancement

Qingsen Yan, Yixu Feng, Cheng Zhang et al.

CVPR 2025arXiv:2502.20272

citations

#425

UniTok: a Unified Tokenizer for Visual Generation and Understanding

Chuofan Ma, Yi Jiang, Junfeng Wu et al.

NEURIPS 2025spotlightarXiv:2502.20321

citations

#426

dKV-Cache: The Cache for Diffusion Language Models

Xinyin Ma, Runpeng Yu, Gongfan Fang et al.

NEURIPS 2025arXiv:2505.15781

citations

#427

ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data

Chengsen Wang, Qi Qi, Jingyu Wang et al.

AAAI 2025paperarXiv:2412.11376

citations

#428

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Yilun Zhao, Lujing Xie, Haowei Zhang et al.

CVPR 2025arXiv:2501.12380

citations

#429

Language Models Learn to Mislead Humans via RLHF

Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.

ICLR 2025arXiv:2409.12822

citations

#430

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.

ICLR 2025arXiv:2407.08223

citations

#431

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen et al.

ICLR 2025

citations

#432

Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization

Qingyang Zhang, Haitao Wu, Changqing Zhang et al.

NEURIPS 2025spotlightarXiv:2504.05812

citations

#433

VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool

Chia-Tung Ho, Haoxing Ren, Brucek Khailany

AAAI 2025paperarXiv:2408.08927

citations

#434

Multimodal Autoregressive Pre-training of Large Vision Encoders

Enrico Fini, Mustafa Shukor, Xiujun Li et al.

CVPR 2025highlightarXiv:2411.14402

citations

#435

Enhancing End-to-End Autonomous Driving with Latent World Model

Yingyan Li, Lue Fan, Jiawei He et al.

ICLR 2025arXiv:2406.08481

citations

#436

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao, Yuhui Xu, Hanze Dong et al.

ICML 2025arXiv:2501.19324

citations

#437

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng et al.

NEURIPS 2025oralarXiv:2505.06708

citations

#438

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

Xi Chen, Zhifei Zhang, He Zhang et al.

CVPR 2025highlightarXiv:2412.07774

citations

#439

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.

NEURIPS 2025spotlightarXiv:2505.23747

citations

#440

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Zhipei Xu, Xuanyu Zhang, Runyi Li et al.

ICLR 2025arXiv:2410.02761

citations

#441

MV-Adapter: Multi-View Consistent Image Generation Made Easy

Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.

ICCV 2025arXiv:2412.03632

citations

#442

Fine-tuning can cripple your foundation model; preserving features may be the solution

Philip Torr, Puneet Dokania, Jishnu Mukhoti et al.

ICLR 2025arXiv:2308.13320

citations

#443

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos et al.

ICLR 2025arXiv:2410.06205

citations

#444

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.

ICML 2025arXiv:2410.05363

citations

#445

Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Bahare Fatemi, Seyed Mehran Kazemi, Anton Tsitsulin et al.

ICLR 2025oralarXiv:2406.09170

citations

#446

Simple Guidance Mechanisms for Discrete Diffusion Models

Yair Schiff, Subham Sahoo, Hao Phung et al.

ICLR 2025arXiv:2412.10193

citations

#447

InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales

Zhepei Wei, Wei-Lin Chen, Yu Meng

ICLR 2025arXiv:2406.13629

citations

#448

RWKV-7 "Goose" with Expressive Dynamic State Evolution

Bo Peng, Ruichong Zhang, Daniel Goldstein et al.

COLM 2025paper

citations

#449

Sundial: A Family of Highly Capable Time Series Foundation Models

Yong Liu, Guo Qin, Zhiyuan Shi et al.

ICML 2025oralarXiv:2502.00816

citations

#450

What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

Sang Choe, Hwijeen Ahn, Juhan Bae et al.

NEURIPS 2025arXiv:2405.13954

citations

#451

HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation

Yi Li, Yuquan Deng, Jesse Zhang et al.

ICLR 2025arXiv:2502.05485

citations

#452

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber, Lijun Yu, Qihang Yu et al.

ICLR 2025arXiv:2409.16211

citations

#453

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model

Zhen Ye, Peiwen Sun, Jiahe Lei et al.

AAAI 2025paperarXiv:2408.17175

citations

#454

EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer

Yuxuan Zhang, Yirui Yuan, Yiren Song et al.

ICCV 2025arXiv:2503.07027

citations

#455

AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

Zewei Zhou, Tianhui Cai, Seth Zhao et al.

NEURIPS 2025arXiv:2506.13757

citations

#456

Diffusion Adversarial Post-Training for One-Step Video Generation

Shanchuan Lin, Xin Xia, Yuxi Ren et al.

ICML 2025arXiv:2501.08316

citations

#457

UMA: A Family of Universal Models for Atoms

Brandon Wood, Misko Dzamba, Xiang Fu et al.

NEURIPS 2025spotlightarXiv:2506.23971

citations

#458

VideoDPO: Omni-Preference Alignment for Video Diffusion Generation

Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.

CVPR 2025arXiv:2412.14167

citations

#459

History-Guided Video Diffusion

Kiwhan Song, Boyuan Chen, Max Simchowitz et al.

ICML 2025oralarXiv:2502.06764

citations

#460

Cradle: Empowering Foundation Agents towards General Computer Control

Weihao Tan, Wentao Zhang, Xinrun Xu et al.

ICML 2025arXiv:2403.03186

citations

#461

BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games

Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.

ICLR 2025arXiv:2411.13543

citations

#462

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Kiho Park, Yo Joong Choe, Yibo Jiang et al.

ICLR 2025arXiv:2406.01506

citations

#463

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.

ICCV 2025arXiv:2501.04003

citations

#464

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

shiduo zhang, Zhe Xu, Peiju Liu et al.

ICCV 2025arXiv:2412.18194

citations

#465

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Wei Chow, Jiageng Mao, Boyi Li et al.

ICLR 2025arXiv:2501.16411

citations

#466

Adaptive Keyframe Sampling for Long Video Understanding

Xi Tang, Jihao Qiu, Lingxi Xie et al.

CVPR 2025arXiv:2502.21271

citations

#467

Planning in Natural Language Improves LLM Search for Code Generation

Evan Wang, Federico Cassano, Catherine Wu et al.

ICLR 2025arXiv:2409.03733

citations

#468

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Yunfei Xie, Ce Zhou, Lang Gao et al.

ICLR 2025arXiv:2408.02900

citations

#469

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

Saaket Agashe, Kyle Wong, Vincent Tu et al.

COLM 2025paperarXiv:2504.00906

citations

#470

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension

Yongdong Luo, Xiawu Zheng, Guilin Li et al.

NEURIPS 2025arXiv:2411.13093

citations

#471

Scaling Test-Time Compute Without Verification or RL is Suboptimal

Amrith Setlur, Nived Rajaraman, Sergey Levine et al.

ICML 2025spotlightarXiv:2502.12118

citations

#472

FreDF: Learning to Forecast in the Frequency Domain

Hao Wang, Lichen Pan, Yuan Shen et al.

ICLR 2025arXiv:2402.02399

citations

#473

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

Ling Fu, Zhebin Kuang, Jiajun Song et al.

NEURIPS 2025arXiv:2501.00321

citations

#474

Offline Actor-Critic for Average Reward MDPs

William Powell, Jeongyeol Kwon, Qiaomin Xie et al.

NEURIPS 2025

citations

#475

DiT4Edit: Diffusion Transformer for Image Editing

Kunyu Feng, Yue Ma, Bingyuan Wang et al.

AAAI 2025paperarXiv:2411.03286

citations

#476

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong et al.

NEURIPS 2025arXiv:2507.02546

citations

#477

MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization

Yiwen Chen, Yikai Wang, Yihao Luo et al.

ICCV 2025arXiv:2408.02555

citations

#478

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Zachary Ankner, Cody Blakeney, Kartik Sreenivasan et al.

ICLR 2025arXiv:2405.20541

citations

#479

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos et al.

ICML 2025spotlightarXiv:2502.07529

citations

#480

Improving Text-to-Image Consistency via Automatic Prompt Optimization

Melissa Hall, Michal Drozdzal, Oscar Mañas et al.

ICLR 2025arXiv:2403.17804

citations

#481

XAttention: Block Sparse Attention with Antidiagonal Scoring

Ruyi Xu, Guangxuan Xiao, Haofeng Huang et al.

ICML 2025arXiv:2503.16428

citations

#482

Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

Xu Liu, Juncheng Liu, Gerald Woo et al.

ICML 2025arXiv:2410.10469

citations

#483

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering

Guofeng Feng, Siyan Chen, Rong Fu et al.

CVPR 2025arXiv:2408.07967

citations

#484

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao et al.

COLM 2025paperarXiv:2504.07086

citations

#485

Weak to Strong Generalization for Large Language Models with Multi-capabilities

Yucheng Zhou, Jianbing Shen, Yu Cheng

ICLR 2025

citations

#486

Thinkless: LLM Learns When to Think

Gongfan Fang, Xinyin Ma, Xinchao Wang

NEURIPS 2025arXiv:2505.13379

citations

#487

SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement

Antonis Antoniades, Albert Örwall, Kexun Zhang et al.

ICLR 2025arXiv:2410.20285

citations

#488

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, Liwei Wang

CVPR 2025arXiv:2412.00493

citations

#489

DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.

ICLR 2025oralarXiv:2503.07656

citations

#490

Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics

Yaniv Nikankin, Anja Reusch, Aaron Mueller et al.

ICLR 2025arXiv:2410.21272

citations

#491

GameGen-X: Interactive Open-world Game Video Generation

Haoxuan Che, Xuanhua He, Quande Liu et al.

ICLR 2025arXiv:2411.00769

citations

#492

GameFactory: Creating New Games with Generative Interactive Videos

Jiwen Yu, Yiran Qin, Xintao Wang et al.

ICCV 2025highlightarXiv:2501.08325

citations

#493

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Xiao Liu, Tianjie Zhang, Yu Gu et al.

ICLR 2025arXiv:2408.06327

citations

#494

LoRA vs Full Fine-tuning: An Illusion of Equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba et al.

NEURIPS 2025arXiv:2410.21228

citations

#495

GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li et al.

ICML 2025

citations

#496

Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments

Hongjin SU, Ruoxi Sun, Jinsung Yoon et al.

ICLR 2025arXiv:2501.10893

citations

#497

Accelerating Diffusion Transformers with Token-wise Feature Caching

Chang Zou, Xuyang Liu, Ting Liu et al.

ICLR 2025arXiv:2410.05317

citations

#498

HPSv3: Towards Wide-Spectrum Human Preference Score

Yuhang Ma, Keqiang Sun, Xiaoshi Wu et al.

ICCV 2025arXiv:2508.03789

citations

#499

MagicPIG: LSH Sampling for Efficient LLM Generation

Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.

ICLR 2025arXiv:2410.16179

citations

#500

Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko, Nicolas Flammarion

ICLR 2025arXiv:2407.11969

citations

#501

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Lei Li, wei yuancheng, Zhihui Xie et al.

CVPR 2025highlightarXiv:2411.17451

citations

#502

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

Tianwei Lin, Wenqiao Zhang, Sijing Li et al.

ICML 2025spotlightarXiv:2502.09838

citations

#503

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li, Haoqin Tu, Mude Hui et al.

ICML 2025arXiv:2406.08478

citations

#504

Augmenting Math Word Problems via Iterative Question Composing

Haoxiong Liu, Yifan Zhang, Yifan Luo et al.

AAAI 2025paperarXiv:2401.09003

citations

#505

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Cheng Yang, Chufan Shi, Yaxin Liu et al.

ICLR 2025arXiv:2406.09961

citations

#506

CycleResearcher: Improving Automated Research via Automated Review

Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.

ICLR 2025arXiv:2411.00816

citations

#507

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Ke Yang, Yao Liu, Sapana Chaudhary et al.

ICLR 2025arXiv:2410.13825

citations

#508

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer, Uriel Singer, Amit Zohar et al.

ICML 2025oralarXiv:2502.02492

citations

#509

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong, Xiao Dong, Haoxiang Li et al.

ICLR 2025arXiv:2407.15886

citations

#510

CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching

Xingjian Wu, Xiangfei Qiu, Zhengyu Li et al.

ICLR 2025arXiv:2410.12261

citations

#511

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen et al.

ICLR 2025arXiv:2410.01756

citations

#512

Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

Fushuo Huo, Wenchao Xu, Zhong Zhang et al.

ICLR 2025arXiv:2408.02032

citations

#513

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

Weize Chen, Ziming You, Ran Li et al.

ICLR 2025arXiv:2407.07061

citations

#514

ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

Zhongxiang Sun, Xiaoxue Zang, Kai Zheng et al.

ICLR 2025arXiv:2410.11414

citations

#515

Scaling Laws for Precision

Tanishq Kumar, Zachary Ankner, Benjamin Spector et al.

ICLR 2025arXiv:2411.04330

citations

#516

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

Luo, Xue Yang, Wenhan Dou et al.

CVPR 2025arXiv:2410.08202

citations

#517

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

Tianchen Zhao, Tongcheng Fang, Haofeng Huang et al.

ICLR 2025arXiv:2406.02540

citations

#518

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Zhenyu Hou, Xin Lv, Rui Lu et al.

ICML 2025arXiv:2501.11651

citations

#519

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Yongting Zhang, Lu Chen, Guodong Zheng et al.

CVPR 2025arXiv:2406.12030

citations

#520

Long Context Compression with Activation Beacon

Peitian Zhang, Zheng Liu, Shitao Xiao et al.

ICLR 2025arXiv:2401.03462

citations

#521

LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models

Marwa Abdulhai, Isadora White, Charlie Snell et al.

ICML 2025oralarXiv:2311.18232

citations

#522

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models

Andy K Zhang, Neil Perry, Riya Dulepet et al.

ICLR 2025arXiv:2408.08926

citations

#523

One-Minute Video Generation with Test-Time Training

Jiarui Xu, Shihao Han, Karan Dalal et al.

CVPR 2025arXiv:2504.05298

citations

#524

Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations

Nick Jiang, Anish Kachinthaya, Suzanne Petryk et al.

ICLR 2025arXiv:2410.02762

citations

#525

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang et al.

ICLR 2025oralarXiv:2410.19702

citations

#526

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang, Mingfei Gao, Zhe Gan et al.

ICLR 2025arXiv:2409.20566

citations

#527

Process Reward Model with Q-value Rankings

Wendi Li, Yixuan Li

ICLR 2025arXiv:2410.11287

citations

#528

Learning Dynamics of LLM Finetuning

YI REN, Danica Sutherland

ICLR 2025arXiv:2407.10490

citations

#529

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.

ICLR 2025arXiv:2406.07358

citations

#530

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.

ICCV 2025arXiv:2503.19755

citations

#531

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering

Yakun Song, Zhuo Chen, Xiaofei Wang et al.

AAAI 2025paperarXiv:2401.07333

citations

#532

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Zhenglin Huang, Jinwei Hu, Yiwei He et al.

CVPR 2025arXiv:2412.04292

citations

#533

Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.

ICLR 2025arXiv:2408.16737

citations

#534

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino, Michele Wang, Tejal Patwardhan et al.

ICML 2025oralarXiv:2502.12115

citations

#535

Image and Video Tokenization with Binary Spherical Quantization

Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl

ICLR 2025arXiv:2406.07548

citations

#536

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Jintao Zhang, Haofeng Huang, Pengle Zhang et al.

ICML 2025arXiv:2411.10958

citations

#537

XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning

Linzheng Chai, Jian Yang, Tao Sun et al.

AAAI 2025paperarXiv:2401.07037

citations

#538

Inductive Moment Matching

Linqi (Alex) Zhou, Stefano Ermon, Jiaming Song

ICML 2025oralarXiv:2503.07565

citations

#539

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences

Canyu Zhao, Mingyu Liu, Wen Wang et al.

ICLR 2025arXiv:2407.16655

citations

#540

RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts

Hjalmar Wijk, Tao Lin, Joel Becker et al.

ICML 2025spotlightarXiv:2411.15114

citations

#541

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang et al.

ICCV 2025arXiv:2412.07825

citations

#542

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, Georg Lange, Neel Nanda

ICLR 2025arXiv:2405.08366

citations

#543

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

Hyungjoo Chae, Namyoung Kim, Kai Ong et al.

ICLR 2025arXiv:2410.13232

citations

#544

Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning

Yiming Huang, Xiao Liu, Yeyun Gong et al.

AAAI 2025paperarXiv:2403.02333

citations

#545

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.

ICML 2025spotlightarXiv:2410.21465

citations

#546

Flow Q-Learning

Seohong Park, Qiyang Li, Sergey Levine

ICML 2025arXiv:2502.02538

citations

#547

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.

ICLR 2025arXiv:2409.07703

citations

#548

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang et al.

ICLR 2025arXiv:2411.05361

citations

#549

StableAnimator: High-Quality Identity-Preserving Human Image Animation

Shuyuan Tu, Zhen Xing, Xintong Han et al.

CVPR 2025arXiv:2411.17697

citations

#550

Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems

Guibin Zhang, Yanwei Yue, Zhixun Li et al.

ICLR 2025oralarXiv:2410.02506

citations

#551

Fast Video Generation with Sliding Tile Attention

Peiyuan Zhang, Yongqi Chen, Runlong Su et al.

ICML 2025oralarXiv:2502.04507

citations

#552

Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models

Fei Shen, Hu Ye, Sibo Liu et al.

AAAI 2025paperarXiv:2407.02482

citations

#553

UniScene: Unified Occupancy-centric Driving Scene Generation

Bohan Li, Jiazhe Guo, Hongsi Liu et al.

CVPR 2025arXiv:2412.05435

citations

#554

An analytic theory of creativity in convolutional diffusion models

Mason Kamb, Surya Ganguli

ICML 2025oralarXiv:2412.20292

citations

#555

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.

ICLR 2025arXiv:2408.11049

citations

#556

StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation

Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.

ICCV 2025arXiv:2312.12491

citations

#557

Task Singular Vectors: Reducing Task Interference in Model Merging

Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.

CVPR 2025arXiv:2412.00081

citations

#558

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin et al.

AAAI 2025paperarXiv:2409.10197

citations

#559

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Shuai Tan, Biao Gong, Xiang Wang et al.

ICLR 2025oralarXiv:2410.10306

citations

#560

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora et al.

ICML 2025arXiv:2502.10517

citations

#561

SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement

Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.

CVPR 2025arXiv:2408.00653

citations

#562

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Vighnesh Subramaniam, Yilun Du, Joshua B Tenenbaum et al.

ICLR 2025arXiv:2501.05707

citations

#563

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.

CVPR 2025arXiv:2412.01827

citations

#564

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Kaixuan Huang, Jiacheng Guo, Zihao Li et al.

ICML 2025arXiv:2502.06453

citations

#565

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong et al.

CVPR 2025highlightarXiv:2409.12957

citations

#566

Matryoshka Multimodal Models

Mu Cai, Jianwei Yang, Jianfeng Gao et al.

ICLR 2025arXiv:2405.17430

citations

#567

MM-RLHF: The Next Step Forward in Multimodal LLM Alignment

Yi-Fan Zhang, Tao Yu, Haochen Tian et al.

ICML 2025arXiv:2502.10391

citations

#568

Why do LLMs attend to the first token?

Federico Barbero, Alvaro Arroyo, Xiangming Gu et al.

COLM 2025paperarXiv:2504.02732

citations

#569

FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection

Yao Xiao, Tingfa Xu, Yu Xin et al.

AAAI 2025paperarXiv:2504.20670

citations

#570

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Xunhao Lai, Jianqiao Lu, Yao Luo et al.

ICLR 2025arXiv:2502.20766

citations

#571

CSGO: Content-Style Composition in Text-to-Image Generation

Peng Xing, Haofan Wang, Yanpeng Sun et al.

NEURIPS 2025arXiv:2408.16766

citations

#572

Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.

CVPR 2025arXiv:2501.08331

citations

#573

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

Hanlin Tang, Yang Lin, Jing Lin et al.

ICLR 2025arXiv:2407.15891

citations

#574

Perception-R1: Pioneering Perception Policy with Reinforcement Learning

En Yu, Kangheng Lin, Liang Zhao et al.

NEURIPS 2025arXiv:2504.07954

citations

#575

AIOS: LLM Agent Operating System

Kai Mei, Xi Zhu, Wujiang Xu et al.

COLM 2025paperarXiv:2403.16971

citations

#576

PnP-Flow: Plug-and-Play Image Restoration with Flow Matching

Ségolène Martin, Anne Gagneux, Paul Hagemann et al.

ICLR 2025arXiv:2410.02423

citations

#577

SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation

Jaehong Yoon, Shoubin Yu, Vaidehi Ramesh Patil et al.

ICLR 2025arXiv:2410.12761

citations

#578

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning

Weiqi Li, Xuanyu Zhang, Shijie Zhao et al.

NEURIPS 2025spotlightarXiv:2503.22679

citations

#579

Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

Shicong Cen, Jincheng Mei, Katayoon Goshvadi et al.

ICLR 2025arXiv:2405.19320

citations

#580

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian Parker, Anton Smirnov, Jordi Pons et al.

ICLR 2025arXiv:2411.19842

citations

#581

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou et al.

COLM 2025paperarXiv:2402.05136

citations

#582

AgentSquare: Automatic LLM Agent Search in Modular Design Space

Yu Shang, Yu Li, Keyu Zhao et al.

ICLR 2025arXiv:2410.06153

citations

#583

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback

Wenyi Xiao, Ziwei Huang, Leilei Gan et al.

AAAI 2025paperarXiv:2404.14233

citations

#584

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.

CVPR 2025arXiv:2503.01774

citations

#585

Proteina: Scaling Flow-based Protein Structure Generative Models

Tomas Geffner, Kieran Didi, Zuobai Zhang et al.

ICLR 2025arXiv:2503.00710

citations

#586

ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery

Ziru Chen, Shijie Chen, Yuting Ning et al.

ICLR 2025arXiv:2410.05080

citations

#587

Building Math Agents with Multi-Turn Iterative Preference Learning

Wei Xiong, Chengshuai Shi, Jiaming Shen et al.

ICLR 2025arXiv:2409.02392

citations

#588

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Daoguang Zan, Zhirong Huang, Wei Liu et al.

NEURIPS 2025arXiv:2504.02605

citations

#589

SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan et al.

ICLR 2025arXiv:2306.00148

citations

#590

MUSt3R: Multi-view Network for Stereo 3D Reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.

CVPR 2025highlightarXiv:2503.01661

citations

#591

Automatically Interpreting Millions of Features in Large Language Models

Gonçalo Paulo, Alex Mallen, Caden Juang et al.

ICML 2025arXiv:2410.13928

citations

#592

C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

Chuangchuang Tan, Renshuai Tao, Huan Liu et al.

AAAI 2025paperarXiv:2408.09647

citations

#593

DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

Wenyao Zhang, Hongsi Liu, Zekun Qi et al.

NEURIPS 2025arXiv:2507.04447

citations

#594

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim et al.

ICLR 2025arXiv:2503.03321

citations

#595

Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model

Chunming He, Chengyu Fang, Yulun Zhang et al.

ICLR 2025arXiv:2311.11638

citations

#596

Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos

Linyi Jin, Richard Tucker, Zhengqi Li et al.

CVPR 2025arXiv:2412.09621

citations

#597

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation

Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.

ICLR 2025arXiv:2409.01586

citations

#598

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning

Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.

AAAI 2025paperarXiv:2408.08640

citations

#599

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

NEURIPS 2025spotlightarXiv:2505.13227

citations

#600

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Rongyao Fang, Chengqi Duan, Kun Wang et al.

NEURIPS 2025

citations

← Previous

1 2 3 4 5...112