Most Cited 2025 "pre-training" Papers
22,274 papers found • Page 3 of 112
Conference
Soft Merging of Experts with Adaptive Routing
Haokun Liu, Muqeeth Mohammed, Colin Raffel
Diffusion-Based Planning for Autonomous Driving with Flexible Guidance
Yinan Zheng, Ruiming Liang, Kexin ZHENG et al.
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Akshara Prabhakar, Zuxin Liu, Ming Zhu et al.
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
Yong Lin, Shange Tang, Bohan Lyu et al.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Chongyu Fan, Jiancheng Liu, Licong Lin et al.
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng et al.
Dissecting Adversarial Robustness of Multimodal LM Agents
Chen Wu, Rishi Shah, Jing Yu Koh et al.
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
Xiyao Wang, Zhengyuan Yang, Chao Feng et al.
D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement
Yansong Peng, Hebei Li, Peixi Wu et al.
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin, James Wilken-Smith, Tomáš Dulka et al.
Real-Time Video Generation with Pyramid Attention Broadcast
Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.
Mamba YOLO: A Simple Baseline for Object Detection with State Space Model
Zeyu Wang, Chen Li, Huiying Xu et al.
Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data
Xinyi Wang, Antonis Antoniades, Yanai Elazar et al.
Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation
Jiahao Cui, Hui Li, Yao Yao et al.
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua et al.
Improving Instruction-Following in Language Models through Activation Steering
Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi et al.
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.
Reasoning with Latent Thoughts: On the Power of Looped Transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.
Language models scale reliably with over-training and on downstream tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar et al.
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF
Tengyang Xie, Dylan Foster, Akshay Krishnamurthy et al.
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng, Masato Ishii, Akio Hayakawa et al.
Eliciting Human Preferences with Language Models
Belinda Li, Alex Tamkin, Noah Goodman et al.
HVI: A New Color Space for Low-light Image Enhancement
Qingsen Yan, Yixu Feng, Cheng Zhang et al.
UniTok: a Unified Tokenizer for Visual Generation and Understanding
Chuofan Ma, Yi Jiang, Junfeng Wu et al.
dKV-Cache: The Cache for Diffusion Language Models
Xinyin Ma, Runpeng Yu, Gongfan Fang et al.
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data
Chengsen Wang, Qi Qi, Jingyu Wang et al.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao, Lujing Xie, Haowei Zhang et al.
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen et al.
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
Qingyang Zhang, Haitao Wu, Changqing Zhang et al.
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool
Chia-Tung Ho, Haoxing Ren, Brucek Khailany
Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini, Mustafa Shukor, Xiujun Li et al.
Enhancing End-to-End Autonomous Driving with Latent World Model
Yingyan Li, Lue Fan, Jiawei He et al.
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Baohao Liao, Yuhui Xu, Hanze Dong et al.
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng et al.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang et al.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
Zhipei Xu, Xuanyu Zhang, Runyi Li et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
Fine-tuning can cripple your foundation model; preserving features may be the solution
Philip Torr, Puneet Dokania, Jishnu Mukhoti et al.
Round and Round We Go! What makes Rotary Positional Encodings useful?
Federico Barbero, Alex Vitvitskyi, Christos Perivolaropoulos et al.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Bahare Fatemi, Seyed Mehran Kazemi, Anton Tsitsulin et al.
Simple Guidance Mechanisms for Discrete Diffusion Models
Yair Schiff, Subham Sahoo, Hao Phung et al.
InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
Zhepei Wei, Wei-Lin Chen, Yu Meng
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Bo Peng, Ruichong Zhang, Daniel Goldstein et al.
Sundial: A Family of Highly Capable Time Series Foundation Models
Yong Liu, Guo Qin, Zhiyuan Shi et al.
What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions
Sang Choe, Hwijeen Ahn, Juhan Bae et al.
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
Yi Li, Yuquan Deng, Jesse Zhang et al.
MaskBit: Embedding-free Image Generation via Bit Tokens
Mark Weber, Lijun Yu, Qihang Yu et al.
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei et al.
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
Yuxuan Zhang, Yirui Yuan, Yiren Song et al.
AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning
Zewei Zhou, Tianhui Cai, Seth Zhao et al.
Diffusion Adversarial Post-Training for One-Step Video Generation
Shanchuan Lin, Xin Xia, Yuxi Ren et al.
UMA: A Family of Universal Models for Atoms
Brandon Wood, Misko Dzamba, Xiang Fu et al.
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.
History-Guided Video Diffusion
Kiwhan Song, Boyuan Chen, Max Simchowitz et al.
Cradle: Empowering Foundation Agents towards General Computer Control
Weihao Tan, Wentao Zhang, Xinrun Xu et al.
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park, Yo Joong Choe, Yibo Jiang et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
shiduo zhang, Zhe Xu, Peiju Liu et al.
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding
Wei Chow, Jiageng Mao, Boyi Li et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
Planning in Natural Language Improves LLM Search for Code Generation
Evan Wang, Federico Cassano, Catherine Wu et al.
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Yunfei Xie, Ce Zhou, Lang Gao et al.
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu et al.
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo, Xiawu Zheng, Guilin Li et al.
Scaling Test-Time Compute Without Verification or RL is Suboptimal
Amrith Setlur, Nived Rajaraman, Sergey Levine et al.
FreDF: Learning to Forecast in the Frequency Domain
Hao Wang, Lichen Pan, Yuan Shen et al.
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Ling Fu, Zhebin Kuang, Jiajun Song et al.
Offline Actor-Critic for Average Reward MDPs
William Powell, Jeongyeol Kwon, Qiaomin Xie et al.
DiT4Edit: Diffusion Transformer for Image Editing
Kunyu Feng, Yue Ma, Bingyuan Wang et al.
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong et al.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization
Yiwen Chen, Yikai Wang, Yihao Luo et al.
Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models
Zachary Ankner, Cody Blakeney, Kartik Sreenivasan et al.
Training Deep Learning Models with Norm-Constrained LMOs
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos et al.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Melissa Hall, Michal Drozdzal, Oscar Mañas et al.
XAttention: Block Sparse Attention with Antidiagonal Scoring
Ruyi Xu, Guangxuan Xiao, Haofeng Huang et al.
Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts
Xu Liu, Juncheng Liu, Gerald Woo et al.
FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
Guofeng Feng, Siyan Chen, Rong Fu et al.
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao et al.
Weak to Strong Generalization for Large Language Models with Multi-capabilities
Yucheng Zhou, Jianbing Shen, Yu Cheng
Thinkless: LLM Learns When to Think
Gongfan Fang, Xinyin Ma, Xinchao Wang
SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
Antonis Antoniades, Albert Örwall, Kexun Zhang et al.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, Liwei Wang
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving
Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.
Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics
Yaniv Nikankin, Anja Reusch, Aaron Mueller et al.
GameGen-X: Interactive Open-world Game Video Generation
Haoxuan Che, Xuanhua He, Quande Liu et al.
GameFactory: Creating New Games with Generative Interactive Videos
Jiwen Yu, Yiran Qin, Xintao Wang et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
LoRA vs Full Fine-tuning: An Illusion of Equivalence
Reece Shuttleworth, Jacob Andreas, Antonio Torralba et al.
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li et al.
Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments
Hongjin SU, Ruoxi Sun, Jinsung Yoon et al.
Accelerating Diffusion Transformers with Token-wise Feature Caching
Chang Zou, Xuyang Liu, Ting Liu et al.
HPSv3: Towards Wide-Spectrum Human Preference Score
Yuhang Ma, Keqiang Sun, Xiaoshi Wu et al.
MagicPIG: LSH Sampling for Efficient LLM Generation
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko, Nicolas Flammarion
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li, wei yuancheng, Zhihui Xie et al.
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Tianwei Lin, Wenqiao Zhang, Sijing Li et al.
What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li, Haoqin Tu, Mude Hui et al.
Augmenting Math Word Problems via Iterative Question Composing
Haoxiong Liu, Yifan Zhang, Yifan Luo et al.
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
CycleResearcher: Improving Automated Research via Automated Review
Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary et al.
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Hila Chefer, Uriel Singer, Amit Zohar et al.
CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models
Zheng Chong, Xiao Dong, Haoxiang Li et al.
CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching
Xingjian Wu, Xiangfei Qiu, Zhengyu Li et al.
ImageFolder: Autoregressive Image Generation with Folded Tokens
Xiang Li, Kai Qiu, Hao Chen et al.
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Fushuo Huo, Wenchao Xu, Zhong Zhang et al.
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
Weize Chen, Ziming You, Ran Li et al.
ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability
Zhongxiang Sun, Xiaoxue Zang, Kai Zheng et al.
Scaling Laws for Precision
Tanishq Kumar, Zachary Ankner, Benjamin Spector et al.
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Luo, Xue Yang, Wenhan Dou et al.
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Tianchen Zhao, Tongcheng Fang, Haofeng Huang et al.
T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
Zhenyu Hou, Xin Lv, Rui Lu et al.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
Yongting Zhang, Lu Chen, Guodong Zheng et al.
Long Context Compression with Activation Beacon
Peitian Zhang, Zheng Liu, Shitao Xiao et al.
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Marwa Abdulhai, Isadora White, Charlie Snell et al.
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models
Andy K Zhang, Neil Perry, Riya Dulepet et al.
One-Minute Video Generation with Test-Time Training
Jiarui Xu, Shihao Han, Karan Dalal et al.
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang, Anish Kachinthaya, Suzanne Petryk et al.
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Xiangyu Zeng, Kunchang Li, Chenting Wang et al.
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang, Mingfei Gao, Zhe Gan et al.
Process Reward Model with Q-value Rankings
Wendi Li, Yixuan Li
Learning Dynamics of LLM Finetuning
YI REN, Danica Sutherland
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering
Yakun Song, Zhuo Chen, Xiaofei Wang et al.
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Zhenglin Huang, Jinwei Hu, Yiwei He et al.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Samuel Miserendino, Michele Wang, Tejal Patwardhan et al.
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Jintao Zhang, Haofeng Huang, Pengle Zhang et al.
XCOT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning
Linzheng Chai, Jian Yang, Tao Sun et al.
Inductive Moment Matching
Linqi (Alex) Zhou, Stefano Ermon, Jiaming Song
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences
Canyu Zhao, Mingyu Liu, Wen Wang et al.
RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts
Hjalmar Wijk, Tao Lin, Joel Becker et al.
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang et al.
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Hyungjoo Chae, Namyoung Kim, Kai Ong et al.
Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning
Yiming Huang, Xiao Liu, Yeyun Gong et al.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.
Flow Q-Learning
Seohong Park, Qiyang Li, Sergey Levine
DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?
Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang et al.
StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu, Zhen Xing, Xintong Han et al.
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
Guibin Zhang, Yanwei Yue, Zhixun Li et al.
Fast Video Generation with Sliding Tile Attention
Peiyuan Zhang, Yongqi Chen, Runlong Su et al.
Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models
Fei Shen, Hu Ye, Sibo Liu et al.
UniScene: Unified Occupancy-centric Driving Scene Generation
Bohan Li, Jiazhe Guo, Hongsi Liu et al.
An analytic theory of creativity in convolutional diffusion models
Mason Kamb, Surya Ganguli
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation
Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.
Task Singular Vectors: Reducing Task Interference in Model Merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye, Qiong Wu, Wenhao Lin et al.
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Shuai Tan, Biao Gong, Xiang Wang et al.
KernelBench: Can LLMs Write Efficient GPU Kernels?
Anne Ouyang, Simon Guo, Simran Arora et al.
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.
Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains
Vighnesh Subramaniam, Yilun Du, Joshua B Tenenbaum et al.
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Kaixuan Huang, Jiacheng Guo, Zihao Li et al.
3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion
Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong et al.
Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao et al.
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
Yi-Fan Zhang, Tao Yu, Haochen Tian et al.
Why do LLMs attend to the first token?
Federico Barbero, Alvaro Arroyo, Xiangming Gu et al.
FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection
Yao Xiao, Tingfa Xu, Yu Xin et al.
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai, Jianqiao Lu, Yao Luo et al.
CSGO: Content-Style Composition in Text-to-Image Generation
Peng Xing, Haofan Wang, Yanpeng Sun et al.
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin et al.
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu, Kangheng Lin, Liang Zhao et al.
AIOS: LLM Agent Operating System
Kai Mei, Xi Zhu, Wujiang Xu et al.
PnP-Flow: Plug-and-Play Image Restoration with Flow Matching
Ségolène Martin, Anne Gagneux, Paul Hagemann et al.
SAFREE: Training-Free and Adaptive Guard for Safe Text-to-Image And Video Generation
Jaehong Yoon, Shoubin Yu, Vaidehi Ramesh Patil et al.
Q-Insight: Understanding Image Quality via Visual Reinforcement Learning
Weiqi Li, Xuanyu Zhang, Shijie Zhao et al.
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Shicong Cen, Jincheng Mei, Katayoon Goshvadi et al.
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
Julian Parker, Anton Smirnov, Jordi Pons et al.
LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K
Tao Yuan, Xuefei Ning, Dong Zhou et al.
AgentSquare: Automatic LLM Agent Search in Modular Design Space
Yu Shang, Yu Li, Keyu Zhao et al.
Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback
Wenyi Xiao, Ziwei Huang, Leilei Gan et al.
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.
Proteina: Scaling Flow-based Protein Structure Generative Models
Tomas Geffner, Kieran Didi, Zuobai Zhang et al.
ScienceAgentBench: Toward Rigorous Assessment of Language Agents for Data-Driven Scientific Discovery
Ziru Chen, Shijie Chen, Yuting Ning et al.
Building Math Agents with Multi-Turn Iterative Preference Learning
Wei Xiong, Chengshuai Shi, Jiaming Shen et al.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan, Zhirong Huang, Wei Liu et al.
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan et al.
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.
Automatically Interpreting Millions of Features in Large Language Models
Gonçalo Paulo, Alex Mallen, Caden Juang et al.
C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection
Chuangchuang Tan, Renshuai Tao, Huan Liu et al.
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
Wenyao Zhang, Hongsi Liu, Zekun Qi et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model
Chunming He, Chengyu Fang, Yulun Zhang et al.
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
Linyi Jin, Richard Tucker, Zhengqi Li et al.
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.
GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing
Rongyao Fang, Chengqi Duan, Kun Wang et al.