Most Cited ICCV Spotlight "integrated multimodal tasks" Papers
2,701 papers found • Page 1 of 14
Conference
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang et al.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
yi yang, Xiaoxuan He, Hongkun Pan et al.
LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu et al.
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, zehai he, Wenyi Hong et al.
OminiControl: Minimal and Universal Control for Diffusion Transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.
Shape of Motion: 4D Reconstruction from a Single Video
Qianqian Wang, Vickie Ye, Hang Gao et al.
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu et al.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
CHENMING ZHU, Tai Wang, Wenwei Zhang et al.
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Quanfeng Lu, Wenqi Shao, Zitao Liu et al.
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video
Jianhong Bai, Menghan Xia, Xiao Fu et al.
FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models
Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas et al.
Randomized Autoregressive Visual Generation
Qihang Yu, Ju He, Xueqing Deng et al.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
gaojie lin, Jianwen Jiang, Jiaqi Yang et al.
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
shaojin wu, Mengqi Huang, wenxu wu et al.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
Wenqiang Sun, Shuo Chen, Fangfu Liu et al.
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Jensen Zhou, Hang Gao, Vikram Voleti et al.
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
Yuxuan Zhang, Yirui Yuan, Yiren Song et al.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
shiduo zhang, Zhe Xu, Peiju Liu et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization
Yiwen Chen, Yikai Wang, Yihao Luo et al.
GameFactory: Creating New Games with Generative Interactive Videos
Jiwen Yu, Yiran Qin, Xintao Wang et al.
HPSv3: Towards Wide-Spectrum Human Preference Score
Yuhang Ma, Keqiang Sun, Xiaoshi Wu et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark
Wufei Ma, Haoyu Chen, Guofeng Zhang et al.
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation
Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.
Long Context Tuning for Video Generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
Xuemeng Yang, Licheng Wen, Tiantian Wei et al.
Golden Noise for Diffusion Models: A Learning Framework
zikai zhou, Shitong Shao, Lichen Bai et al.
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li et al.
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Qi Qin, Le Zhuo, Yi Xin et al.
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Chen Ziwen, Hao Tan, Kai Zhang et al.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong et al.
End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Yingyan Li, Yuqi Wang, Yang Liu et al.
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
Wangbo Yu, Chaoran Feng, Jianing Li et al.
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
Ruiyuan Gao, Kai Chen, Bo Xiao et al.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging
Chongjie Ye, Yushuang Wu, Ziteng Lu et al.
Aether: Geometric-Aware Unified World Modeling
Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.
TerraMind: Large-Scale Generative Multimodality for Earth Observation
Johannes Jakubik, Felix Yang, Benedikt Blumenstiel et al.
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Yongxin Zhu, Bocheng Li, Yifei Xin et al.
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Xingyu Chen, Yue Chen, Yuliang Xiu et al.
WorldScore: Unified Evaluation Benchmark for World Generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng, Junyi Zhang, Qianqian Wang et al.
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
Learning 4D Embodied World Models
Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Kumara Kahatapitiya, Haozhe Liu, Sen He et al.
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He, Yifeng Geng, Liefeng Bo
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Hao He, Ceyuan Yang, Shanchuan Lin et al.
QuEST: Low-bit Diffusion Model Quantization via Efficient Selective Finetuning
Haoxuan Wang, Yuzhang Shang, Zhihang Yuan et al.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Chunwei Wang, Guansong Lu, Junwei Yang et al.
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang et al.
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Alexander Mai, Peter Hedman, George Kopanas et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
YOLOE: Real-Time Seeing Anything
Ao Wang, Lihao Liu, Hui Chen et al.
Scaling Language-Free Visual Representation Learning
David Fan, Shengbang Tong, Jiachen Zhu et al.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Junsong Chen, Shuchen Xue, Yuyang Zhao et al.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Minghua Liu, Mikaela Uy, Donglai Xiang et al.
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong et al.
Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
Zeren Jiang, Chuanxia Zheng, Iro Laina et al.
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
Li Hu, wang yuan, Zhen Shen et al.
Human-Object Interaction from Human-Level Instructions
Zhen Wu, Jiaman Li, Pei Xu et al.
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Yun Li, Yiming Zhang, Tao Lin et al.
Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
Yuqi Li, Chuanguang Yang, Hansheng Zeng et al.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
Wanshui Gan, Fang Liu, Hongbin Xu et al.
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu et al.
FaceXFormer: A Unified Transformer for Facial Analysis
Kartik Narayan, Vibashan VS, Rama Chellappa et al.
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu et al.
SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling
Xianglong He, Zi-Xin Zou, Chia Hao Chen et al.
MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs
Erik Daxberger, Nina Wenzel, David Griffiths et al.
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Hui Zhang, Dexiang Hong, Yitong Wang et al.
LEGION: Learning to Ground and Explain for Synthetic Image Detection
Hengrui Kang, Siwei Wen, Zichen Wen et al.
Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
yifei xia, Suhan Ling, Fangcheng Fu et al.
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
Mark YU, Wenbo Hu, Jinbo Xing et al.
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Ruowen Zhao, James Jun Liang Chen Ye, Zhengyi Wang et al.
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models
Yifan Lu, Xuanchi Ren, Jiawei Yang et al.
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Ahmed Nassar, Matteo Omenetti, Maksym Lysak et al.
A₀ : An Affordance-Aware Hierarchical Model for General Robotic Manipulation
Rongtao Xu, Jian Zhang, Minghao Guo et al.
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt
Lukas Höllein, Aljaz Bozic, Michael Zollhöfer et al.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang, Yucheng Zhao, Tiancai Wang et al.
Improved Noise Schedule for Diffusion Training
Tiankai Hang, Shuyang Gu, Jianmin Bao et al.
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics
Taowen Wang, Cheng Han, James Liang et al.
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Hailong Guo, Bohan Zeng, Yiren Song et al.
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Le Zhuo, Liangbing Zhao, Sayak Paul et al.
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model
Yupeng Zheng, Pengxuan Yang, Zebin Xing et al.
Long-Context State-Space Video World Models
Ryan Po, Yotam Nitzan, Richard Zhang et al.
VCA: Video Curious Agent for Long Video Understanding
Zeyuan Yang, Delin Chen, Xueyang Yu et al.
VSSD: Vision Mamba with Non-Causal State Space Duality
Yuheng Shi, Mingjia Li, Minjing Dong et al.
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang, Jiawei Kong, Wenbo Yu et al.
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
Teng Li, Guangcong Zheng, Rui Jiang et al.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
Epona: Autoregressive Diffusion World Model for Autonomous Driving
Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu et al.
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
Dongwon Kim, Ju He, Qihang Yu et al.
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space
Lixing Xiao, Shunlin Lu, Huaijin Pi et al.
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation
Xin Zhou, DINGKANG LIANG, Sifan Tu et al.
Bolt3D: Generating 3D Scenes in Seconds
Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan et al.
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Liming Jiang, Qing Yan, Yumin Jia et al.
VistaDream: Sampling multiview consistent images for single-view scene reconstruction
Haiping Wang, Yuan Liu, Ziwei Liu et al.
SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
Chun-Han Yao, Yiming Xie, Vikram Voleti et al.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
Yiren Song, Danze Chen, Mike Zheng Shou
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
ZIYU ZHU, Xilin Wang, Yixuan Li et al.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Rui Xie, Yinhong Liu, Penghao Zhou et al.
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Yuxuan Cai, Jiangning Zhang, Haoyang He et al.
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.
KV-Edit: Training-Free Image Editing for Precise Background Preservation
Tianrui Zhu, Shiyi Zhang, Jiawei Shao et al.
EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing
Zexuan Yan, Yue Ma, Chang Zou et al.
Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Kyle Sargent, Kyle Hsu, Justin Johnson et al.
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
Rongchang Xie, Chen Du, Ping Song et al.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.
Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
Yufei Zhan, Shurong Zheng, Yousong Zhu et al.
Scaling Laws for Native Multimodal Models
Mustafa Shukor, Enrico Fini, Victor Guilherme Turrisi da Costa et al.
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment
Haonan Han, Rui Yang, Huan Liao et al.
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
Walid Bousselham, Angie Boggust, Sofian Chaybouti et al.
RadGPT: Constructing 3D Image-Text Tumor Datasets
Pedro Bassi, Mehmet Yavuz, Ibrahim Ethem Hamamci et al.
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Ming Hu, Kun yuan, Yaling Shen et al.
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning
Zhangquan Chen, Xufang Luo, Dongsheng Li
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li et al.
CARP: Visuomotor Policy Learning via Coarse-to-Fine Autoregressive Prediction
Zhefei Gong, Pengxiang Ding, Shangke Lyu et al.
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models
Xianfu Cheng, Wei Zhang, Shiwei Zhang et al.
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs
Yunqiu Xu, Linchao Zhu, Yi Yang
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
Zhen Xing, Qi Dai, Zejia Weng et al.
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang et al.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
XINJIE ZHANG, Zhening Liu, Yifan Zhang et al.
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han et al.
Dynamic Point Maps: A Versatile Representation for Dynamic 3D Reconstruction
Edgar Sucar, Zihang Lai, Eldar Insafutdinov et al.
FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait
Taekyung Ki, Dongchan Min, Gyeongsu Chae
Go to Zero: Towards Zero-shot Motion Generation with Million-scale Data
Ke Fan, Shunlin Lu, Minyue Dai et al.
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
Massimiliano Viola, Kevin Qu, Nando Metzger et al.
Scalable Ranked Preference Optimization for Text-to-Image Generation
Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata et al.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
Tianhao Wu, Chuanxia Zheng, Frank Guan et al.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina et al.
Dataset Distillation via the Wasserstein Metric
Haoyang Liu, Peiran Wang, Yijiang Li et al.
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Runjia Li, Philip Torr, Andrea Vedaldi et al.
DC-AE 1.5: Accelerating Diffusion Model Convergence with Structured Latent Space
Junyu Chen, Dongyun Zou, Wenkun He et al.
SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition
Yongkun Du, Zhineng Chen, Hongtao Xie et al.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.
Puppet-Master: Scaling Interactive Video Generation as a Motion Prior for Part-Level Dynamics
Ruining Li, Chuanxia Zheng, Christian Rupprecht et al.
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Shufan Li, Konstantinos Kallidromitis, Akash Gokul et al.
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
Yuechen Zhang, YaoYang Liu, Bin Xia et al.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Yi Chen, Yuying Ge, Weiliang Tang et al.
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.
MM-IFEngine: Towards Multimodal Instruction Following
Shengyuan Ding, Wu Shenxi, Xiangyu Zhao et al.
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen, Zhengrong Yue, Siran Chen et al.
MotionFollower: Editing Video Motion via Score-Guided Diffusion
Shuyuan Tu, Qi Dai, Zihao Zhang et al.
IRASim: A Fine-Grained World Model for Robot Manipulation
Fangqi Zhu, Hongtao Wu, Song Guo et al.
CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs
Yihan Cao, Jiazhao Zhang, Zhinan Yu et al.
Radiant Foam: Real-Time Differentiable Ray Tracing
Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi et al.
Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars
Tobias Kirschstein, Javier Romero, Artem Sevastopolsky et al.
FonTS: Text Rendering With Typography and Style Controls
Wenda SHI, Yiren Song, Dengming Zhang et al.
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Weiming Ren, Wentao Ma, Huan Yang et al.
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
Jiazhe Guo, Yikang Ding, Xiwu Chen et al.
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Yating Wang, Haoyi Zhu, Mingyu Liu et al.
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
Weitai Kang, Haifeng Huang, Yuzhang Shang et al.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
Ding Zhong, Xu Zheng, Chenfei Liao et al.
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
Ma Teng, Xiaojun Jia, Ranjie Duan et al.
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
Mark Endo, Xiaohan Wang, Serena Yeung-Levy
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
Yujie Zhou, Jiazi Bu, Pengyang Ling et al.
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Yue Li, Qi Ma, Runyi Yang et al.
STIV: Scalable Text and Image Conditioned Video Generation
Zongyu Lin, Wei Liu, Chen Chen et al.
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei, Jiacong Wang, Haochen Wang et al.
MMAD: Multi-label Micro-Action Detection in Videos
Kun Li, pengyu Liu, Dan Guo et al.
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction
Zewei Zhou, Hao Xiang, Zhaoliang Zheng et al.
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Zhong-Yu Li, Ruoyi Du, Juncheng Yan et al.
Unraveling the Smoothness Properties of Diffusion Models: A Gaussian Mixture Perspective
Yingyu Liang, Zhizhou Sha, Zhenmei Shi et al.
RayZer: A Self-supervised Large View Synthesis Model
Hanwen Jiang, Hao Tan, Peng Wang et al.
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
Tian-Xing Xu, Xiangjun Gao, Wenbo Hu et al.
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation
Yuseung Lee, Jihyeon Je, Chanho Park et al.
Video-T1: Test-time Scaling for Video Generation
Fangfu Liu, Hanyang Wang, Yimo Cai et al.
Towards a Unified Copernicus Foundation Model for Earth Vision
Yi Wang, Zhitong Xiong, Chenying Liu et al.
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
Chaojun Ni, Xiaofeng Wang, Zheng Zhu et al.
SynCity: Training-Free Generation of 3D Cities
Paul Engstler, Aleksandar Shtedritski, Iro Laina et al.
Revelio: Interpreting and leveraging semantic information in diffusion models
Dahye Kim, Xavier Thomas, Deepti Ghadiyaram
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Quanhao Li, Zhen Xing, Rui Wang et al.
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
Fatemeh Ghezloo, Saygin Seyfioglu, Rustin Soraki et al.
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
Yuhui WU, Liyi Chen, Ruibin Li et al.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Xinran Ling, Chen Zhu, Meiqi Wu et al.
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
Tianming Liang, Kun-Yu Lin, Chaolei Tan et al.
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
Zhisheng Zhong, Chengyao Wang, Yuqi Liu et al.
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
Xiyao Wang, Zhengyuan Yang, Linjie Li et al.
FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion
Haonan Qiu, Shiwei Zhang, Yujie Wei et al.
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
Xuan Yao, Junyu Gao, Changsheng Xu
CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Danila Rukhovich, Elona Dupont, Dimitrios Mallis et al.
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning
Junming Liu, Siyuan Meng, Yanting Gao et al.
TAPNext: Tracking Any Point (TAP) as Next Token Prediction
Artem Zholus, Carl Doersch, Yi Yang et al.
PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation
Hengjia Li, Haonan Qiu, Shiwei Zhang et al.
Neighboring Autoregressive Modeling for Efficient Visual Generation
Yefei He, Yuanyu He, Shaoxuan He et al.
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao, Xiaotong Li, Yufeng Cui et al.
Scalable Image Tokenization with Index Backpropagation Quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge et al.