Most Cited 2025 "perfect matching" Papers
22,274 papers found • Page 5 of 112
Conference
OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code
Maxence Faldor, Jenny Zhang, Antoine Cully et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev, Sahar Abdelnabi, Soroush Tabesh et al.
Data Shapley in One Training Run
Jiachen (Tianhao) Wang, Prateek Mittal, Dawn Song et al.
MLGym: A New Framework and Benchmark for Advancing AI Research Agents
Deepak Nathani, Lovish Madaan, Nicholas Roberts et al.
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori, Tian Tang, Yile Gu et al.
The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control
Ruili Feng, Han Zhang, Zhilei Shu et al.
STAIR: Improving Safety Alignment with Introspective Reasoning
Yichi Zhang, Siyuan Zhang, Yao Huang et al.
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Dominic Maggio, Hyungtae Lim, Luca Carlone
Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference
Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar et al.
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Zonglin Yang, Wanhao Liu, Ben Gao et al.
SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning
Hojoon Lee, Dongyoon Hwang, Donghu Kim et al.
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
Scaling FP8 training to trillion-token LLMs
Maxim Fishman, Brian Chmiel, Ron Banner et al.
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu, Xiaoxin He, Miao Xiong et al.
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Peixian Ma, Xialie Zhuang, Chengjin Xu et al.
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
Sucheng Ren, Qihang Yu, Ju He et al.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou et al.
AnyEdit: Edit Any Knowledge Encoded in Language Models
Houcheng Jiang, Junfeng Fang, Ningyu Zhang et al.
Test-time Alignment of Diffusion Models without Reward Over-optimization
Sunwoo Kim, Minkyu Kim, Dongmin Park
OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitting
Xing Hu, Yuan Cheng, Dawei Yang et al.
Learn Beneficial Noise as Graph Augmentation
Siqi Huang, Yanchen Xu, Hongyuan Zhang et al.
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan, Jiyao Zhang, Tianshu Wu et al.
RMB: Comprehensively benchmarking reward models in LLM alignment
Enyu Zhou, Guodong Zheng, Binghai Wang et al.
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Pan Wang, Qiang Zhou, Yawen Wu et al.
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
Andy (DiJia) Su, Sainbayar Sukhbaatar, Michael Rabbat et al.
Reloc3r: Large-Scale Training of Relative Camera Pose Regression for Generalizable, Fast, and Accurate Visual Localization
Siyan Dong, Shuzhe Wang, Shaohui Liu et al.
Selective Aggregation for Low-Rank Adaptation in Federated Learning
Pengxin Guo, Shuang Zeng, Yanran Wang et al.
Learning How Hard to Think: Input-Adaptive Allocation of LM Computation
Mehul Damani, Idan Shenfeld, Andi Peng et al.
Reasoning Gym: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Zafir Stojanovski, Oliver Stanley, Joe Sharratt et al.
4K4DGen: Panoramic 4D Generation at 4K Resolution
Renjie Li, Panwang Pan, Bangbang Yang et al.
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
Orion Weller, Ben Van Durme, Dawn Lawrie et al.
Rank1: Test-Time Compute for Reranking in Information Retrieval
Orion Weller, Kathryn Ricci, Eugene Yang et al.
Generator Matching: Generative modeling with arbitrary Markov processes
Peter Holderrieth, Marton Havasi, Jason Yim et al.
Empirical Design in Reinforcement Learning
Andrew Patterson, Samuel F Neumann, Martha White et al.
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Shiqi Chen, Tongyao Zhu, Ruochen Zhou et al.
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan et al.
HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection
Zican Shi, Jing Hu, Jie Ren et al.
Real-Time Execution of Action Chunking Flow Policies
Kevin Black, Manuel Galliker, Sergey Levine
WritingBench: A Comprehensive Benchmark for Generative Writing
Yuning Wu, Jiahao Mei, Ming Yan et al.
SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Kaixuan Huang, Xudong Guo, Mengdi Wang
Universal Actions for Enhanced Embodied Foundation Models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu et al.
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa, Zijian He, Reyna Abhyankar et al.
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Dahyun Kang, Piotr Bojanowski, Huy V. Vo et al.
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang et al.
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou, Shudong Liu, Maizhen Ning et al.
R2E-Gym: Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
Naman Jain, Jaskirat Singh, Manish Shetty et al.
Image Conductor: Precision Control for Interactive Video Synthesis
Yaowei Li, Xintao Wang, Zhaoyang Zhang et al.
WorldScore: Unified Evaluation Benchmark for World Generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.
Depth Any Video with Scalable Synthetic Data
Honghui Yang, Di Huang, Wei Yin et al.
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
M-LLM Based Video Frame Selection for Efficient Video Understanding
Kai Hu, Feng Gao, Xiaohan Nie et al.
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang, Ming Yin, Jieyu Zhang et al.
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif et al.
End-to-End Autonomous Driving Through V2X Cooperation
Haibao Yu, Wenxian Yang, Jiaru Zhong et al.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng, Junyi Zhang, Qianqian Wang et al.
What Can RL Bring to VLA Generalization? An Empirical Study
Jijia Liu, Feng Gao, Bingwen Wei et al.
HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLMs
Pham Vu Tuan Dat, Long Doan, Huynh Thi Thanh Binh
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Zhangheng LI, Keen You, Haotian Zhang et al.
Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan et al.
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
MobileMamba: Lightweight Multi-Receptive Visual Mamba Network
Haoyang He, Jiangning Zhang, Yuxuan Cai et al.
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Andrew Williams, Arjun Ashok, Étienne Marcotte et al.
Taming Overconfidence in LLMs: Reward Calibration in RLHF
Jixuan Leng, Chengsong Huang, Banghua Zhu et al.
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang et al.
Learning 4D Embodied World Models
Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent
Jianke Zhang, Yanjiang Guo, Yucheng Hu et al.
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
Ekin Akyürek, Mehul Damani, Adam Zweiger et al.
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin, Emily Liu, Joyce Chen et al.
LLMs as Research Tools: A Large Scale Survey of Researchers’ Usage and Perceptions
Zhehui Liao, Maria Antoniak, Inyoung Cheong et al.
PEARL: Parallel Speculative Decoding with Adaptive Draft Length
Tianyu Liu, Yun Li, Qitan Lv et al.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He et al.
How efficient is LLM-generated code? A rigorous & high-standard benchmark
Ruizhong Qiu, Weiliang Zeng, James Ezick et al.
Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design
Chenyu Wang, Masatoshi Uehara, Yichun He et al.
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Yingying Deng, Xiangyu He, Changwang Mei et al.
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents
Jen-Tse Huang, Jiaxu Zhou, Tailin Jin et al.
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit et al.
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung, Faeze Brahman, Yejin Choi
Towards Practical Real-Time Neural Video Compression
Zhaoyang Jia, Bin Li, Jiahao Li et al.
Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
Ziying Song, Caiyan Jia, Lin Liu et al.
Autoregressive Adversarial Post-Training for Real-Time Interactive Video Generation
Shanchuan Lin, Ceyuan Yang, Hao He et al.
To Code or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma et al.
ReasonIR: Training Retrievers for Reasoning Tasks
Rulin Shao, Rui Qiao, Varsha Kishore et al.
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Yao Teng, Han Shi, Xian Liu et al.
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu, Zhanming Jie, Hanze Dong et al.
Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding
Yiming Wang, Pei Zhang, Siyuan Huang et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
Scaling RL to Long Videos
Yukang Chen, Wei Huang, Baifeng Shi et al.
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Jinbin Bai, Tian Ye, Wei Chow et al.
MET3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim, Christopher Wewer, Thomas Wimmer et al.
Video World Models with Long-term Spatial Memory
Tong Wu, Shuai Yang, Ryan Po et al.
Accelerating Diffusion LLMs via Adaptive Parallel Decoding
Daniel Israel, Guy Van den Broeck, Aditya Grover
InterMimic: Towards Universal Whole-Body Control for Physics-Based Human-Object Interactions
Sirui Xu, Hung Yu Ling, Yu-Xiong Wang et al.
Thinking LLMs: General Instruction Following with Thought Generation
Tianhao Wu, Janice Lan, Weizhe Yuan et al.
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
Haoxuan Wang, Yuzhang Shang, Zhihang Yuan et al.
RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
Min Zhao, Guande He, Yixiao Chen et al.
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Chengqi Lyu, Songyang Gao, Yuzhe Gu et al.
Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition
Kun Li, Dan Guo, Guoliang Chen et al.
Sonata: Self-Supervised Learning of Reliable Point Representations
Xiaoyang Wu, Daniel DeTone, Duncan Frost et al.
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair
Mingjie Liu, Yun-Da Tsai, Wenfei Zhou et al.
One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation
Zhendong Wang, Max Li, Ajay Mandlekar et al.
AdaWorld: Learning Adaptable World Models with Latent Actions
Shenyuan Gao, Siyuan Zhou, Yilun Du et al.
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Hao He, Ceyuan Yang, Shanchuan Lin et al.
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He, Yifeng Geng, Liefeng Bo
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix Jedidja Binder, James Chua, Tomek Korbak et al.
Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts
Haizhong Zheng, Yang Zhou, Brian Bartoldson et al.
AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP
wenxin ma, Xu Zhang, Qingsong Yao et al.
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark
Kush Jain, Gabriel Synnaeve, Baptiste Roziere
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Chunwei Wang, Guansong Lu, Junwei Yang et al.
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana et al.
Theory on Mixture-of-Experts in Continual Learning
Hongbo Li, Sen Lin, Lingjie Duan et al.
On the Optimization and Generalization of Multi-head Attention
Christos Thrampoulidis, Rouzbeh Ghaderi, Hossein Taheri et al.
An Architecture Search Framework for Inference-Time Techniques
Jon Saad-Falcon, Adrian Lafuente, Shlok Natarajan et al.
Detecting Data Deviations in Electronic Health Records
Kaiping Zheng, Horng-Ruey Chua, Beng Chin Ooi
The Semantic Hub Hypothesis: Language Models Share Semantic Representations Across Languages and Modalities
Zhaofeng Wu, Xinyan Yu, Dani Yogatama et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
Large Language Models Assume People are More Rational than We Really are
Ryan Liu, Jiayi Geng, Joshua Peterson et al.
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang et al.
ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning
Jingyang Yi, Jiazheng Wang, Sida Li
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di, Zhelun Yu, Guanghao Zhang et al.
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents
Qianhui Wu, Kanzhi Cheng, Rui Yang et al.
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang et al.
Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment
Audrey Huang, Adam Block, Qinghua Liu et al.
VISTA3D: A Unified Segmentation Foundation Model For 3D Medical Imaging
Yufan He, Pengfei Guo, Yucheng Tang et al.
CollabLLM: From Passive Responders to Active Collaborators
Shirley Wu, Michel Galley, Baolin Peng et al.
Arctic-Embed 2.0: Multilingual Retrieval Without Compromise
Puxuan Yu, Luke Merrick, Gaurav Nuti et al.
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
Hongyu Li, Jinyu Chen, Ziyu Wei et al.
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Jintao Zhang, Chendong Xiang, Haofeng Huang et al.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.
5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks
Dongshuo Yin, Leiyi Hu, Bin Li et al.
TabArena: A Living Benchmark for Machine Learning on Tabular Data
Nick Erickson, Lennart Purucker, Andrej Tschalzev et al.
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.
Towards Realistic Data Generation for Real-World Super-Resolution
Long Peng, Wenbo Li, Renjing Pei et al.
STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving
Kefan Dong, Tengyu Ma
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Alexander Mai, Peter Hedman, George Kopanas et al.
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
Wanggui He, Siming Fu, Mushui Liu et al.
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Zhaorun Chen, Mintong Kang, Bo Li
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding, Bolian Li, Ruqi Zhang
How Does Critical Batch Size Scale in Pre-training?
Hanlin Zhang, Depen Morwani, Nikhil Vyas et al.
Self-Evolving Multi-Agent Collaboration Networks for Software Development
Yue Hu, Yuzhu Cai, Yaxin Du et al.
Learning Harmonized Representations for Speculative Sampling
Lefan Zhang, Xiaodan Wang, Yanhua Huang et al.
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer
Zhen Han, Zeyinzi Jiang, Yulin Pan et al.
Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Yuchen Zhou, Jiayuan Gu, Tung Chiang et al.
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
Zhiyuan Yan, Jiangming Wang, Peng Jin et al.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang et al.
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li et al.
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
Hui Li, Mingwang Xu, Qingkun Su et al.
Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh et al.
Driving in the Occupancy World: Vision-Centric 4D Occupancy Forecasting and Planning via World Models for Autonomous Driving
Yu Yang, Jianbiao Mei, Yukai Ma et al.
YOLOE: Real-Time Seeing Anything
Ao Wang, Lihao Liu, Hui Chen et al.
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Rong Li, Shijie Li, Lingdong Kong et al.
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
Wei Zhao, Pengxiang Ding, Zhang Min et al.
Competition Dynamics Shape Algorithmic Phases of In-Context Learning
Core Francisco Park, Ekdeep Singh Lubana, Hidenori Tanaka
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
Zhongshen Zeng, Pengguang Chen, Shu Liu et al.
Reasoning-SQL: Reinforcement Learning with SQL Tailored Partial Rewards for Reasoning-Enhanced Text-to-SQL
Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun et al.
A Distractor-Aware Memory for Visual Object Tracking with SAM2
Alan Lukezic, Jovana Videnović, Matej Kristan
Aligning Language Models with Demonstrated Feedback
Omar Shaikh, Michelle Lam, Joey Hejna et al.
Robust Function-Calling for On-Device Language Model via Function Masking
Qiqiang Lin, Muning Wen, Qiuying Peng et al.
Real2Code: Reconstruct Articulated Objects via Code Generation
Mandi Zhao, Yijia Weng, Dominik Bauer et al.
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate
Yubo Wang, Xiang Yue, Wenhu Chen
Why Does the Effective Context Length of LLMs Fall Short?
Chenxin An, Jun Zhang, Ming Zhong et al.
GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi et al.
Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification
Wenxuan Huang, Zijie Zhai, Yunhang Shen et al.
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu, CHENGKAI JIN, Huanyu Wang et al.
PhyT2V: LLM-Guided Iterative Self-Refinement for Physics-Grounded Text-to-Video Generation
Qiyao Xue, Xiangyu Yin, Boyuan Yang et al.
Aesthetic Post-Training Diffusion Models from Generic Preferences with Step-by-step Preference Optimization
Zhanhao Liang, Yuhui Yuan, Shuyang Gu et al.
TSD-SR: One-Step Diffusion with Target Score Distillation for Real-World Image Super-Resolution
linwei dong, Qingnan Fan, Yihong Guo et al.
Learning to Prompt with Text Only Supervision for Vision-Language Models
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer et al.
Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking
Xiantao Hu, Ying Tai, Xu Zhao et al.
Faster Video Diffusion with Trainable Sparse Attention
Peiyuan Zhang, Yongqi Chen, Haofeng Huang et al.
GenXD: Generating Any 3D and 4D Scenes
Yuyang Zhao, Chung-Ching Lin, Kevin Lin et al.
Parallelized Autoregressive Visual Generation
Yuqing Wang, Shuhuai Ren, Zhijie Lin et al.
Agents' Room: Narrative Generation through Multi-step Collaboration
Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki et al.
PolaFormer: Polarity-aware Linear Attention for Vision Transformers
Weikang Meng, Yadan Luo, Xin Li et al.
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
David Junhao Zhang, Roni Paiss, Shiran Zada et al.
ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval
Zixu Li, Zhiwei Chen, Haokun Wen et al.
Transformer Layers as Painters
Qi Sun, Marc Pickett, Aakash Kumar Nain et al.
Diffusion Feedback Helps CLIP See Better
Wenxuan Wang, Quan Sun, Fan Zhang et al.
SymmCD: Symmetry-Preserving Crystal Generation with Diffusion Models
Daniel Levy, Siba Smarak Panigrahi, Sékou-Oumar Kaba et al.
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia, Yongqi Li, Jun Zhang et al.
Diffusion Self-Distillation for Zero-Shot Customized Image Generation
Shengqu Cai, Eric Ryan Chan, Yunzhi Zhang et al.
On the expressiveness and spectral bias of KANs
Yixuan Wang, Jonathan Siegel, Ziming Liu et al.
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Yaxi Lu, Shenzhi Yang, Cheng Qian et al.
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation
Koichi Namekata, Sherwin Bahmani, Ziyi Wu et al.
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
Xinshuai Song, weixing chen, Yang Liu et al.
OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels
Meng Lou, Yizhou Yu
TinySAM: Pushing the Envelope for Efficient Segment Anything Model
Han Shu, Wenshuo Li, Yehui Tang et al.
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Hang Yin, Xiuwei Xu, Linqing Zhao et al.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Junsong Chen, Shuchen Xue, Yuyang Zhao et al.
DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation
Dongya Jia, Zhuo Chen, Jiawei Chen et al.
Adjoint Sampling: Highly Scalable Diffusion Samplers via Adjoint Matching
Aaron Havens, Benjamin Kurt Miller, Bing Yan et al.
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
Zhiyuan Yan, Yandan Zhao, Shen Chen et al.
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation
Zehuan Huang, Yuanchen Guo, Xingqiao An et al.
DELTA: DENSE EFFICIENT LONG-RANGE 3D TRACKING FOR ANY VIDEO
Tuan Ngo, Peiye Zhuang, Evangelos Kalogerakis et al.
Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond
Chongyu Fan, jinghan jia, Yihua Zhang et al.
KungfuBot: Physics-Based Humanoid Whole-Body Control for Learning Highly-Dynamic Skills
Weiji Xie, Jinrui Han, Jiakun Zheng et al.
Re-thinking Temporal Search for Long-Form Video Understanding
Jinhui Ye, Zihan Wang, Haosen Sun et al.