Most Cited 2025 Spotlight "llm hallucinations" Papers
22,274 papers found • Page 1 of 112
Conference
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Clemencia Siro, Guy Gur-Ari, Gaurav Mishra et al.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng et al.
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
Chaoyou Fu, Peixian Chen, Yunhang Shen et al.
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu et al.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, Han, Alex Gu et al.
YOLOv12: Attention-Centric Real-Time Object Detectors
Yunjie Tian, Qixiang Ye, DAVID DOERMANN
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
WorldSimBench: Towards Video Generation Models as World Simulators
Yiran Qin, Zhelun Shi, Jiwen Yu et al.
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Javier Rando, Tony Wang, Stewart Slocum et al.
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li et al.
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue et al.
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Haipeng Luo, Qingfeng Sun, Can Xu et al.
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Jipeng Zhang, Hanze Dong, Tong Zhang et al.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue, Zhiqi Chen, Rui Lu et al.
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Mark Towers, Ariel Kwiatkowski, John Balis et al.
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert, Jacob Morrison, Valentina Pyatkin et al.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai et al.
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Tianzhe Chu, Yuexiang Zhai, Jihan Yang et al.
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh, Keivan Alizadeh-Vahid, Hooman Shahrokhi et al.
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng XIANG, Zelong Lv, Sicheng Xu et al.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu et al.
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim et al.
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Songming Liu, Lingxuan Wu, Bangguo Li et al.
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You et al.
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Chenhao Tan, Robert Ness, Amit Sharma et al.
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko, francesco croce, Nicolas Flammarion
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
Weihao Zeng, Yuzhen Huang, Qian Liu et al.
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song et al.
LIMO: Less is More for Reasoning
Yixin Ye, Zhen Huang, Yang Xiao et al.
Generative Verifiers: Reward Modeling as Next-Token Prediction
Lunjun Zhang, Arian Hosseini, Hritik Bansal et al.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.
From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline
Tianle Li, Wei-Lin Chiang, Evan Frick et al.
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang et al.
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su et al.
U-KAN Makes Strong Backbone for Medical Image Segmentation and Generation
Chenxin Li, Xinyu Liu, Wuyang Li et al.
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han et al.
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang et al.
Scaling and evaluating sparse autoencoders
Leo Gao, Tom Dupre la Tour, Henk Tillman et al.
Training Language Models to Self-Correct via Reinforcement Learning
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal et al.
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush K Chakravarthy, Anikait Singh et al.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Chunting Zhou, Lili Yu, Arun Babu et al.
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain et al.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
Shenzhi Wang, Le Yu, Chang Gao et al.
Safety Alignment Should be Made More Than Just a Few Tokens Deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong, Delong Ran, Jinyuan Liu et al.
Mixture-of-Agents Enhances Large Language Model Capabilities
Junlin Wang, Jue Wang, Ben Athiwaratkun et al.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu et al.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Chenglei Si, Diyi Yang, Tatsunori Hashimoto
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh vahid et al.
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios
Ethan Chern, Steffi Chern, Shiqi Chen et al.
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
Junyi Zhang, Charles Herrmann, Junhwa Hur et al.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Zhangchen Xu, Fengqing Jiang, Luyao Niu et al.
OmniGen: Unified Image Generation
Shitao Xiao, Yueze Wang, Junjie Zhou et al.
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Xinyu Guan, Li Lyna Zhang, Yifei Liu et al.
SpinQuant: LLM Quantization with Learned Rotations
Zechun Liu, Changsheng Zhao, Igor Fedorov et al.
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu, Xinggang Wang, Xinlong Wang
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
yi yang, Xiaoxuan He, Hongkun Pan et al.
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Ali Hatamizadeh, Jan Kautz
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks, Can Rager, Eric Michaud et al.
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li et al.
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Boyu Gou, Demi Ruohan Wang, Boyuan Zheng et al.
LoRA Learns Less and Forgets Less
Jonathan Frankle, Jose Javier Gonzalez Ortiz, Cody Blakeney et al.
Continuous 3D Perception Model with Persistent State
Qianqian Wang, Yifei Zhang, Aleksander Holynski et al.
A-Mem: Agentic Memory for LLM Agents
Wujiang Xu, Zujie Liang, Kai Mei et al.
L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Pranjal Aggarwal, Sean Welleck
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Zayne Sprague, Fangcong Yin, Juan Rodriguez et al.
Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models
Jimeng Sun, Shubhendu Trivedi, Zhen Lin
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim et al.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu et al.
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu et al.
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang et al.
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, zehai he, Wenyi Hong et al.
Pyramidal Flow Matching for Efficient Video Generative Modeling
Yang Jin, Zhicheng Sun, Ningyuan Li et al.
OminiControl: Minimal and Universal Control for Diffusion Transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.
Generative Representational Instruction Tuning
Niklas Muennighoff, Hongjin SU, Liang Wang et al.
Self-Play Preference Optimization for Language Model Alignment
Yue Wu, Zhiqing Sun, Rina Hughes et al.
Flow-GRPO: Training Flow Matching Models via Online RL
Jie Liu, Gongye Liu, Jiajun Liang et al.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma et al.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Yukang Chen, Fuzhao Xue, Dacheng Li et al.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen et al.
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Chris Rawles, Sarah Clinckemaillie, Yifan Chang et al.
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation
Kepan Nan, Rui Xie, Penghao Zhou et al.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao et al.
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri, Melissa Z Pan, Shuyi Yang et al.
Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data
Jingyang Ou, Shen Nie, Kaiwen Xue et al.
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Jian Han, Jinlai Liu, Yi Jiang et al.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal et al.
WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Xiaoxi Li, Jiajie Jin, Guanting Dong et al.
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
Marianne Arriola, Aaron Gokaslan, Justin Chiu et al.
Revisiting Feature Prediction for Learning Visual Representations from Video
Quentin Garrido, Yann LeCun, Michael Rabbat et al.
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine et al.
Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts
Xiaoming Shi, Shiyu Wang, Yuqi Nie et al.
MambaOut: Do We Really Need Mamba for Vision?
Weihao Yu, Xinchao Wang
EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba
Xiaohuan Pei, Tao Huang, Chang Xu
Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Yiping Wang, Qing Yang, Zhiyuan Zeng et al.
Inverse Scaling: When Bigger Isn't Better
Joe Cavanagh, Andrew Gritsevskiy, Najoung Kim et al.
Shape of Motion: 4D Reconstruction from a Single Video
Qianqian Wang, Vickie Ye, Hang Gao et al.
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai et al.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Jingfeng Yao, Bin Yang, Xinggang Wang
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao et al.
Advancing LLM Reasoning Generalists with Preference Trees
Lifan Yuan, Ganqu Cui, Hanbin Wang et al.
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
Haozhe Wang, Chao Qu, Zuming Huang et al.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Yiheng Xu, Zekun Wang, Junli Wang et al.
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Jianing "Jed" Yang, Alexander Sax, Kevin Liang et al.
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo et al.
Training Language Models to Reason Efficiently
Daman Arora, Andrea Zanette
ToolRL: Reward is All Tool Learning Needs
Cheng Qian, Emre Can Acikgoz, Qi He et al.
Gated Delta Networks: Improving Mamba2 with Delta Rule
Songlin Yang, Jan Kautz, Ali Hatamizadeh
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Sakshi, Utkarsh Tyagi, Sonal Kumar et al.
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
Jiahui Gao, Renjie Pi, Jipeng Zhang et al.
Diffusion Models Are Real-Time Game Engines
Dani Valevski, Yaniv Leviathan, Moab Arar et al.
The Unreasonable Ineffectiveness of the Deeper Layers
Andrey Gromov, Kushal Tirumala, Hassan Shapourian et al.
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions
Zhiyuan Chen, Jiajiong Cao, Zhiquan Chen et al.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Amrith Setlur, Chirag Nagpal, Adam Fisch et al.
OLMoE: Open Mixture-of-Experts Language Models
Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld et al.
SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression
Xin Wang, Yu Zheng, Zhongwei Wan et al.
MUSE: Machine Unlearning Six-Way Evaluation for Language Models
Weijia Shi, Jaechan Lee, Yangsibo Huang et al.
Latent Action Pretraining from Videos
Seonghyeon Ye, Joel Jang, Byeongguk Jeon et al.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
Bencheng Liao, Shaoyu Chen, haoran yin et al.
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
Ruicheng Wang, Sicheng Xu, Cassie Lee Dai et al.
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu et al.
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
Yuang Zhang, Jiaxi Gu, Li-Wen Wang et al.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Jonas Geiping, Sean McLeish, Neel Jain et al.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Wenbo Hu, Xiangjun Gao, Xiaoyu Li et al.
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi et al.
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Shansan Gong, Shivam Agarwal, Yizhe Zhang et al.
Training Software Engineering Agents and Verifiers with SWE-Gym
Jiayi Pan, Xingyao Wang, Graham Neubig et al.
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
Yuxiang Wei, Olivier Duchenne, Jade Copet et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
Simplifying, Stabilizing and Scaling Continuous-time Consistency Models
Cheng Lu, Yang Song
AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models
Junfeng Fang, Houcheng Jiang, Kun Wang et al.
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu et al.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang et al.
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng et al.
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu et al.
Navigation World Models
Amir Bar, Gaoyue Zhou, Danny Tran et al.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu et al.
Retrieval Head Mechanistically Explains Long-Context Factuality
Wenhao Wu, Yizhong Wang, Guangxuan Xiao et al.
World Model on Million-Length Video And Language With Blockwise RingAttention
Hao Liu, Wilson Yan, Matei Zaharia et al.
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models
Bofei Gao, Feifan Song, Zhe Yang et al.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang et al.
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Andrew Zhao, Yiran Wu, Yang Yue et al.
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, Vahab Mirrokni
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
YiFan Zhang, Huanyu Zhang, Haochen Tian et al.
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing
MUDE HUI, Siwei Yang, Bingchen Zhao et al.
DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation
Guosheng Zhao, Xiaofeng Wang, Zheng Zhu et al.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
Yangzhen Wu, Zhiqing Sun, Shanda Li et al.
Scaling Large Language Model-based Multi-Agent Collaboration
Chen Qian, Zihao Xie, YiFei Wang et al.
Diffusion Policy Policy Optimization
Allen Ren, Justin Lidard, Lars Ankile et al.
Layer by Layer: Uncovering Hidden Representations in Language Models
Oscar Skean, Md Rifat Arefin, Dan Zhao et al.
Segment Any 3D Gaussians
Jiazhong Cen, Jiemin Fang, Chen Yang et al.
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He et al.
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.
Physics of Language Models: Part 3.2, Knowledge Manipulation
Zeyuan Allen-Zhu, Yuanzhi Li
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe et al.
DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search
Huajian Xin, Z.Z. Ren, Junxiao Song et al.
Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
Fangyu Lei, Jixuan Chen, Yuxiao Ye et al.
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Shubham Toshniwal, Wei Du, Ivan Moshkov et al.
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
Konstantin Klemmer, Esther Rolf, Caleb Robinson et al.
Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models
Junyu Chen, Han Cai, Junsong Chen et al.
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Chaoyou Fu, Haojia Lin, Xiong Wang et al.
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Zuyan Liu, Yuhao Dong, Ziwei Liu et al.
Language Prompt for Autonomous Driving
Dongming Wu, Wencheng Han, Yingfei Liu et al.
OOTDiffusion: Outfitting Fusion Based Latent Diffusion for Controllable Virtual Try-On
Yuhao Xu, Tao Gu, Weifeng Chen et al.
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
Zhengqi Li, Richard Tucker, Forrester Cole et al.
C3oT: Generating Shorter Chain-of-Thought Without Compromising Effectiveness
Yu Kang, Xianghui Sun, Liangyu Chen et al.
AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea
Qifan Yu, Wei Chow, Zhongqi Yue et al.
IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities
Ziyang Li, Saikat Dutta, Mayur Naik
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qingkai Fang, Shoutao Guo, Yan Zhou et al.
MMaDA: Multimodal Large Diffusion Language Models
Ling Yang, Ye Tian, Bowen Li et al.
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, Jeff Clune
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Shengpeng Ji, Ziyue Jiang, Wen Wang et al.
Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control
Carles Domingo i Enrich, Michal Drozdzal, Brian Karrer et al.
DepthSplat: Connecting Gaussian Splatting and Depth
Haofei Xu, Songyou Peng, Fangjinhua Wang et al.
Imagine While Reasoning in Space: Multimodal Visualization-of-Thought
Chengzu Li, Wenshan Wu, Huanyu Zhang et al.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin, Linjie Li, Difei Gao et al.
LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations
Hadas Orgad, Michael Toker, Zorik Gekhman et al.
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solver
Zhenting Qi, Mingyuan MA, Jiahang Xu et al.
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun et al.
Learning to Reason under Off-Policy Guidance
Jianhao Yan, Yafu Li, Zican Hu et al.
Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection
Jiangnan Yang, Shuangli Liu, Jingjun Wu et al.
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Liao Qu, Huichao Zhang, Yiheng Liu et al.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu et al.
WonderWorld: Interactive 3D Scene Generation from a Single Image
Hong-Xing Yu, Haoyi Duan, Charles Herrmann et al.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
CHENMING ZHU, Tai Wang, Wenwei Zhang et al.
Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction
Jing He, Haodong Li, Wei Yin et al.
Improving Video Generation with Human Feedback
Jie Liu, Gongye Liu, Jiajun Liang et al.