Most Cited 2025 "statistical belief estimation" Papers
22,274 papers found • Page 3 of 112
Conference
Simplifying Deep Temporal Difference Learning
Matteo Gallici, Mattie Fellows, Benjamin Ellis et al.
Task Singular Vectors: Reducing Task Interference in Model Merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
Wangbo Yu, Chaoran Feng, Jianing Li et al.
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.
CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control
Guy Tevet, Sigal Raab, Setareh Cohan et al.
Multiple Object Tracking as ID Prediction
Ruopeng Gao, Ji Qi, Limin Wang
TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining
Wanchao Liang, Tianyu Liu, Less Wright et al.
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Swarnadeep Saha, Xian Li, Marjan Ghazvininejad et al.
Tell me about yourself: LLMs are aware of their learned behaviors
Jan Betley, Xuchan Bao, Martín Soto et al.
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye, Qiong Wu, Wenhao Lin et al.
Flow Q-Learning
Seohong Park, Qiyang Li, Sergey Levine
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Rang Meng, Xingyu Zhang, Yuming Li et al.
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology
Xiangyu Wang, Donglin Yang, ziqin wang et al.
A Decade's Battle on Dataset Bias: Are We There Yet?
Zhuang Liu, Kaiming He
Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection
Jia Guo, Shuai Lu, Weihang Zhang et al.
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Qi Qin, Le Zhuo, Yi Xin et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
Physics-Informed Diffusion Models
Jan-Hendrik Bastek, WaiChing Sun, Dennis Kochmann
MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Ziyang Ma, Yinghao Ma, Yanqiao Zhu et al.
RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics
Enshen Zhou, Jingkun An, Cheng Chi et al.
NoLiMa: Long-Context Evaluation Beyond Literal Matching
Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt et al.
How to Evaluate Reward Models for RLHF
Evan Frick, Tianle Li, Connor Chen et al.
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
Adam Karvonen, Can Rager, Johnny Lin et al.
Inference Scaling for Long-Context Retrieval Augmented Generation
Zhenrui Yue, Honglei Zhuang, Aijun Bai et al.
OmniBench: Towards The Future of Universal Omni-Language Models
Yizhi Li, Ge Zhang, Yinghao Ma et al.
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei et al.
FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
Xunhao Lai, Jianqiao Lu, Yao Luo et al.
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Zebin Xing, Xingyu Zhang, Yang Hu et al.
BOND: Aligning LLMs with Best-of-N Distillation
Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot-Desenonges et al.
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen et al.
Does Spatial Cognition Emerge in Frontier Models?
Santhosh Kumar Ramakrishnan, Erik Wijmans, Philipp Krähenbühl et al.
MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation
Jinfeng Xu, Zheyu Chen, Shuo Yang et al.
End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Yingyan Li, Yuqi Wang, Yang Liu et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
TranSplat: Generalizable 3D Gaussian Splatting from Sparse Multi-View Images with Transformers
Chuanrui Zhang, Yingshuang Zou, Zhuoling Li et al.
RepoGraph: Enhancing AI Software Engineering with Repository-level Code Graph
Siru Ouyang, Wenhao Yu, Kaixin Ma et al.
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma, Huachen Gao, Haoge Deng et al.
Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient
Yongliang Wu, Shiji Zhou, Mingzhuo Yang et al.
BioDiscoveryAgent: An AI Agent for Designing Genetic Perturbation Experiments
Yusuf Roohani, Andrew Lee, Qian Huang et al.
NVS-Solver: Video Diffusion Model as Zero-Shot Novel View Synthesizer
Meng YOU, Zhiyu Zhu, Hui LIU et al.
MMFakeBench: A Mixed-Source Multimodal Misinformation Detection Benchmark for LVLMs
Xuannan Liu, Zekun Li, Pei Li et al.
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He et al.
Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models
Seyedmorteza Sadat, Otmar Hilliges, Romann Weber
GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, Diji Yang et al.
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
Han Lin, Jaemin Cho, Abhay Zala et al.
Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference
Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar et al.
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.
Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems
Tian Ye, Zicheng Xu, Yuanzhi Li et al.
VinePPO: Refining Credit Assignment in RL Training of LLMs
Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance et al.
Energy-Based Diffusion Language Models for Text Generation
Minkai Xu, Tomas Geffner, Karsten Kreis et al.
RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval
Kaiyue Wen, Xingyu Dang, Kaifeng Lyu
Calibrating Large Language Models with Sample Consistency
Qing Lyu, Kumar Shridhar, Chaitanya Malaviya et al.
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Fanqing Meng, Jin Wang, Chuanhao Li et al.
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
Tiansheng Huang, Gautam Bhattacharya, Pratik Joshi et al.
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
Duy-Kien Nguyen, Mahmoud Assran, Unnat Jain et al.
WorldMem: Long-term Consistent World Simulation with Memory
Zeqi Xiao, Yushi LAN, Yifan Zhou et al.
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li, Zhe Chen, Weiyun Wang et al.
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
Yatin Dandi, Florent Krzakala, Bruno Loureiro et al.
Aether: Geometric-Aware Unified World Modeling
Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Alexander Wettig, Kyle Lo, Sewon Min et al.
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng et al.
Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
Zhenting Qi, Hanlin Zhang, Eric P Xing et al.
EraseAnything: Enabling Concept Erasure in Rectified Flow Transformers
Daiheng Gao, Shilin Lu, Wenbo Zhou et al.
CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models
Zhihang Lin, Mingbao Lin, Yuan Xie et al.
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow
Fu-Yun Wang, Ling Yang, Zhaoyang Huang et al.
Language Model Can Listen While Speaking
Ziyang Ma, Yakun Song, Chenpeng Du et al.
LoRA-Pro: Are Low-Rank Adapters Properly Optimized?
Zhengbo Wang, Jian Liang, Ran He et al.
Eliminating Position Bias of Language Models: A Mechanistic Approach
Ziqi Wang, Hanlin Zhang, Xiner Li et al.
ALLaM: Large Language Models for Arabic and English
M Saiful Bari, Yazeed Alnumay, Norah Alzahrani et al.
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua, Yunlong Tang, Chenliang Xu et al.
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.
Towards Interpreting Visual Information Processing in Vision-Language Models
Clement Neo, Luke Ong, Philip Torr et al.
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Weizhe Yuan, Jane Yu, Song Jiang et al.
Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving
Daoguang Zan, Zhirong Huang, Wei Liu et al.
Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang et al.
MultiBooth: Towards Generating All Your Concepts in an Image from Text
Chenyang Zhu, Kai Li, Yue Ma et al.
The Matrix: Infinite-Horizon World Generation with Real-Time Moving Control
Ruili Feng, Han Zhang, Zhilei Shu et al.
Learn Your Reference Model for Real Good Alignment
Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov et al.
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
Matthew Chang, Gunjan Chhablani, Alexander Clegg et al.
SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
Peixian Ma, Xialie Zhuang, Chengjin Xu et al.
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
Muzhi Dai, Chenxu Yang, Qingyi Si
NETS: A Non-equilibrium Transport Sampler
Michael Albergo, Eric Vanden-Eijnden
WorldScore: Unified Evaluation Benchmark for World Generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.
Image Conductor: Precision Control for Interactive Video Synthesis
Yaowei Li, Xintao Wang, Zhaoyang Zhang et al.
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori et al.
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan et al.
M-LLM Based Video Frame Selection for Efficient Video Understanding
Kai Hu, Feng Gao, Xiaohan Nie et al.
Model merging with SVD to tie the Knots
George Stoica, Pratik Ramesh, Boglarka Ecsedi et al.
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
Danny Driess, Jost Springenberg, Brian Ichter et al.
StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization
Zhuoqun Li, Xuanang Chen, Haiyang Yu et al.
TEOChat: A Large Vision-Language Assistant for Temporal Earth Observation Data
Jeremy Irvin, Emily Liu, Joyce Chen et al.
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning
Andy (DiJia) Su, Hanlin Zhu, Yingchen Xu et al.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
Junfei Wu, Jian Guan, Kaituo Feng et al.
EQ-VAE: Equivariance Regularized Latent Space for Improved Generative Image Modeling
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris et al.
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
Rui Chen, Jianfeng Zhang, Yixun Liang et al.
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori, Tian Tang, Yile Gu et al.
MOOSE-Chem: Large Language Models for Rediscovering Unseen Chemistry Scientific Hypotheses
Zonglin Yang, Wanhao Liu, Ben Gao et al.
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Fan-Yun Sun, Weiyu Liu, Siyi Gu et al.
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit et al.
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Katrin Renz, Long Chen, Elahe Arani et al.
TabM: Advancing tabular deep learning with parameter-efficient ensembling
Yury Gorishniy, Akim Kotelnikov, Artem Babenko
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev, Sahar Abdelnabi, Soroush Tabesh et al.
VGGT-SLAM: Dense RGB SLAM Optimized on the SL(4) Manifold
Dominic Maggio, Hyungtae Lim, Luca Carlone
Learning How Hard to Think: Input-Adaptive Allocation of LM Computation
Mehul Damani, Idan Shenfeld, Andi Peng et al.
Depth Any Video with Scalable Synthetic Data
Honghui Yang, Di Huang, Wei Yin et al.
End-to-End Autonomous Driving Through V2X Cooperation
Haibao Yu, Wenxian Yang, Jiaru Zhong et al.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang
RRM: Robust Reward Model Training Mitigates Reward Hacking
Tianqi Liu, Wei Xiong, Jie Ren et al.
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif et al.
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Haotong Lin, Sida Peng, Jingxiao Chen et al.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
Ruiyuan Gao, Kai Chen, Bo Xiao et al.
Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Long Le, Jason Xie, William Liang et al.
Atom of Thoughts for Markov LLM Test-Time Scaling
Fengwei Teng, Quan Shi, Zhaoyang Yu et al.
Data Shapley in One Training Run
Jiachen (Tianhao) Wang, Prateek Mittal, Dawn Song et al.
On the Optimization and Generalization of Multi-head Attention
Christos Thrampoulidis, Rouzbeh Ghaderi, Hossein Taheri et al.
OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code
Maxence Faldor, Jenny Zhang, Antoine Cully et al.
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Yongxin Zhu, Bocheng Li, Yifei Xin et al.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
Visual Agents as Fast and Slow Thinkers
Guangyan Sun, Mingyu Jin, Zhenting Wang et al.
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang et al.
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang et al.
How efficient is LLM-generated code? A rigorous & high-standard benchmark
Ruizhong Qiu, Weiliang Zeng, James Ezick et al.
Vision Language Models are In-Context Value Learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan, Jiyao Zhang, Tianshu Wu et al.
Generator Matching: Generative modeling with arbitrary Markov processes
Peter Holderrieth, Marton Havasi, Jason Yim et al.
Detecting Data Deviations in Electronic Health Records
Kaiping Zheng, Horng-Ruey Chua, Beng Chin Ooi
Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models
Yinlam Chow, Guy Tennenholtz, Izzeddin Gur et al.
LLM Generated Persona is a Promise with a Catch
Leon Li, Haozhe Chen, Hongseok Namkoong et al.
MET3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim, Christopher Wewer, Thomas Wimmer et al.
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis
Pan Wang, Qiang Zhou, Yawen Wu et al.
ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer
Zhen Han, Zeyinzi Jiang, Yulin Pan et al.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Chunwei Wang, Guansong Lu, Junwei Yang et al.
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Roman Bachmann, Jesse Allardice, David Mizrahi et al.
Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms
Zhangheng LI, Keen You, Haotian Zhang et al.
Learning 4D Embodied World Models
Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.
Selective Aggregation for Low-Rank Adaptation in Federated Learning
Pengxin Guo, Shuang Zeng, Yanran Wang et al.
Catastrophic Failure of LLM Unlearning via Quantization
Zhiwei Zhang, Fali Wang, Xiaomin Li et al.
Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models
Orion Weller, Ben Van Durme, Dawn Lawrie et al.
Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
Jinbin Bai, Tian Ye, Wei Chow et al.
RAD: Training an End-to-End Driving Policy via Large-Scale 3DGS-based Reinforcement Learning
Hao Gao, Shaoyu Chen, Bo Jiang et al.
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He, Yifeng Geng, Liefeng Bo
RMB: Comprehensively benchmarking reward models in LLM alignment
Enyu Zhou, Guodong Zheng, Binghai Wang et al.
Frame-Voyager: Learning to Query Frames for Video Large Language Models
Sicheng Yu, CHENGKAI JIN, Huanyu Wang et al.
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning
Yibin Wang, li zhimin, Yuhang Zang et al.
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung, Faeze Brahman, Yejin Choi
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding
Yao Teng, Han Shi, Xian Liu et al.
Context is Key: A Benchmark for Forecasting with Essential Textual Information
Andrew Williams, Arjun Ashok, Étienne Marcotte et al.
Universal Actions for Enhanced Embodied Foundation Models
Jinliang Zheng, Jianxiong Li, Dongxiu Liu et al.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou et al.
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding, Bolian Li, Ruqi Zhang
Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization
Audrey Huang, Wenhao Zhan, Tengyang Xie et al.
Real2Code: Reconstruct Articulated Objects via Code Generation
Mandi Zhao, Yijia Weng, Dominik Bauer et al.
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback
Marcus Williams, Micah Carroll, Adhyyan Narang et al.
CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding
Guo Chen, Yicheng Liu, Yifei Huang et al.
GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control
Mariam Hassan, Sebastian Stapf, Ahmad Rahimi et al.
Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition
Kun Li, Dan Guo, Guoliang Chen et al.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng, Junyi Zhang, Qianqian Wang et al.
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Dahyun Kang, Piotr Bojanowski, Huy V. Vo et al.
WritingBench: A Comprehensive Benchmark for Generative Writing
Yuning Wu, Jiahao Mei, Ming Yan et al.
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa, Zijian He, Reyna Abhyankar et al.
Video World Models with Long-term Spatial Memory
Tong Wu, Shuai Yang, Ryan Po et al.
Accelerated Sampling from Masked Diffusion Models via Entropy Bounded Unmasking
Heli Ben-Hamu, Itai Gat, Daniel Severo et al.
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference under Ambiguities
Zheyuan Zhang, Fengyuan Hu, Jayjun Lee et al.
A Distractor-Aware Memory for Visual Object Tracking with SAM2
Alan Lukezic, Jovana Videnović, Matej Kristan
Self-Evolving Multi-Agent Collaboration Networks for Software Development
Yue Hu, Yuzhu Cai, Yaxin Du et al.
A-Bench: Are LMMs Masters at Evaluating AI-generated Images?
Zicheng Zhang, Haoning Wu, Chunyi Li et al.
SANA: Efficient High-Resolution Text-to-Image Synthesis with Linear Diffusion Transformers
Enze Xie, Junsong Chen, Junyu Chen et al.
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
Multi-subject Open-set Personalization in Video Generation
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.
Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning
Zhiyuan Yan, Yandan Zhao, Shen Chen et al.
Don't Shake the Wheel: Momentum-Aware Planning in End-to-End Autonomous Driving
Ziying Song, Caiyan Jia, Lin Liu et al.
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation
Xiao Fu, Xian Liu, Xintao WANG et al.
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation
Hui Li, Mingwang Xu, Qingkun Su et al.
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li et al.
Scaling Speech-Text Pre-training with Synthetic Interleaved Data
Aohan Zeng, Zhengxiao Du, Mingdao Liu et al.
To Code or Not To Code? Exploring Impact of Code in Pre-training
Viraat Aryabumi, Yixuan Su, Raymond Ma et al.
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
Rong Li, Shijie Li, Lingdong Kong et al.
The Surprising Effectiveness of Test-Time Training for Few-Shot Learning
Ekin Akyürek, Mehul Damani, Adam Zweiger et al.
Point-SAM: Promptable 3D Segmentation Model for Point Clouds
Yuchen Zhou, Jiayuan Gu, Tung Chiang et al.
On the expressiveness and spectral bias of KANs
Yixuan Wang, Jonathan Siegel, Ziming Liu et al.
Fine-Tuning Discrete Diffusion Models via Reward Optimization with Applications to DNA and Protein Design
Chenyu Wang, Masatoshi Uehara, Yichun He et al.
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di, Zhelun Yu, Guanghao Zhang et al.
Human-inspired Episodic Memory for Infinite Context LLMs
Zafeirios Fountas, Martin A Benfeghoul, Adnan Oomerjee et al.
Learning to Prompt with Text Only Supervision for Vision-Language Models
Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Muzammal Naseer et al.
Theory on Mixture-of-Experts in Continual Learning
Hongbo Li, Sen Lin, Lingjie Duan et al.
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi et al.
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix Jedidja Binder, James Chua, Tomek Korbak et al.
Trajectory attention for fine-grained video motion control
Zeqi Xiao, Wenqi Ouyang, Yifan Zhou et al.
Towards Realistic Data Generation for Real-World Super-Resolution
Long Peng, Wenbo Li, Renjing Pei et al.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Minghua Liu, Mikaela Uy, Donglai Xiang et al.
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
Zhongshen Zeng, Pengguang Chen, Shu Liu et al.
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux et al.
DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
Hualie Jiang, Zhiqiang Lou, Laiyan Ding et al.
TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark
Kush Jain, Gabriel Synnaeve, Baptiste Roziere
RATT: A Thought Structure for Coherent and Correct LLM Reasoning
Jinghan Zhang, Xiting Wang, Weijieying Ren et al.
HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLMs
Pham Vu Tuan Dat, Long Doan, Huynh Thi Thanh Binh