Most Cited ICCV Oral "neural motion flow fields" Papers
2,701 papers found • Page 1 of 14
Conference
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang et al.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, ZiangWu ZiangWu et al.
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
yi yang, Xiaoxuan He, Hongkun Pan et al.
OminiControl: Minimal and Universal Control for Diffusion Transformer
Zhenxiong Tan, Songhua Liu, Xingyi Yang et al.
CoTracker3: Simpler and Better Point Tracking by Pseudo-Labelling Real Videos
Nikita Karaev, Iurii Makarov, Jianyuan Wang et al.
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, zehai he, Wenyi Hong et al.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.
VACE: All-in-One Video Creation and Editing
Zeyinzi Jiang, Zhen Han, Chaojie Mao et al.
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
CHENMING ZHU, Tai Wang, Wenwei Zhang et al.
GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices
Quanfeng Lu, Wenqi Shao, Zitao Liu et al.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
Wenqiang Sun, Shuo Chen, Fangfu Liu et al.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
gaojie lin, Jianwen Jiang, Jiaqi Yang et al.
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Jensen Zhou, Hang Gao, Vikram Voleti et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
Yuxuan Zhang, Yirui Yuan, Yiren Song et al.
GameFactory: Creating New Games with Generative Interactive Videos
Jiwen Yu, Yiran Qin, Xintao Wang et al.
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation
Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
Xuemeng Yang, Licheng Wen, Tiantian Wei et al.
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Chen Ziwen, Hao Tan, Kai Zhang et al.
Long Context Tuning for Video Generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
Shuangrui Ding, Rui Qian, Xiaoyi Dong et al.
Phantom: Subject-Consistent Video Generation via Cross-Modal Alignment
Lijie Liu, Tianxiang Ma, Bingchuan Li et al.
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images
Wangbo Yu, Chaoran Feng, Jianing Li et al.
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
Qi Qin, Le Zhuo, Yi Xin et al.
End-to-End Driving with Online Trajectory Evaluation via BEV World Model
Yingyan Li, Yuqi Wang, Yang Liu et al.
Describe Anything: Detailed Localized Image and Video Captioning
Long Lian, Yifan Ding, Yunhao Ge et al.
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He et al.
Aether: Geometric-Aware Unified World Modeling
Haoyi Zhu, Yifan Wang, Jianjun Zhou et al.
WorldScore: Unified Evaluation Benchmark for World Generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen et al.
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers
Yuntao Chen, Yuqi Wang, Zhaoxiang Zhang
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control
Ruiyuan Gao, Kai Chen, Bo Xiao et al.
Addressing Representation Collapse in Vector Quantized Models with One Linear Layer
Yongxin Zhu, Bocheng Li, Yifei Xin et al.
UniPortrait: A Unified Framework for Identity-Preserving Single- and Multi-Human Image Personalization
Junjie He, Yifeng Geng, Liefeng Bo
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Chunwei Wang, Guansong Lu, Junwei Yang et al.
Learning 4D Embodied World Models
Haoyu Zhen, Qiao Sun, Hongxin Zhang et al.
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
Zhibo Yang, Jun Tang, Zhaohai Li et al.
St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World
Haiwen Feng, Junyi Zhang, Qianqian Wang et al.
PartField: Learning 3D Feature Fields for Part Segmentation and Beyond
Minghua Liu, Mikaela Uy, Donglai Xiang et al.
Scaling Language-Free Visual Representation Learning
David Fan, Shengbang Tong, Jiachen Zhu et al.
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Alexander Mai, Peter Hedman, George Kopanas et al.
PhysTwin: Physics-Informed Reconstruction and Simulation of Deformable Objects from Videos
Hanxiao Jiang, Hao-Yu Hsu, Kaifeng Zhang et al.
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance
Li Hu, wang yuan, Zhen Shen et al.
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting
Wanshui Gan, Fang Liu, Hongbin Xu et al.
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu et al.
SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation
Junsong Chen, Shuchen Xue, Yuyang Zhao et al.
Human-Object Interaction from Human-Level Instructions
Zhen Wu, Jiaman Li, Pei Xu et al.
STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?
Yun Li, Yiming Zhang, Tao Lin et al.
Frequency-Aligned Knowledge Distillation for Lightweight Spatiotemporal Forecasting
Yuqi Li, Chuanguang Yang, Hansheng Zeng et al.
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
Zhi Hou, Tianyi Zhang, Yuwen Xiong et al.
TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models
Mark YU, Wenbo Hu, Jinbo Xing et al.
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Ruowen Zhao, James Jun Liang Chen Ye, Zhengyi Wang et al.
YOLOE: Real-Time Seeing Anything
Ao Wang, Lihao Liu, Hui Chen et al.
From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu et al.
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu et al.
CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation
Hui Zhang, Dexiang Hong, Yitong Wang et al.
LEGION: Learning to Ground and Explain for Synthetic Image Detection
Hengrui Kang, Siwei Wen, Zichen Wen et al.
VCA: Video Curious Agent for Long Video Understanding
Zeyuan Yang, Delin Chen, Xueyang Yu et al.
Training-free and Adaptive Sparse Attention for Efficient Long Video Generation
yifei xia, Suhan Ling, Fangcheng Fu et al.
Any2AnyTryon: Leveraging Adaptive Position Embeddings for Versatile Virtual Clothing Tasks
Hailong Guo, Bohan Zeng, Yiren Song et al.
InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
Liming Jiang, Qing Yan, Yumin Jia et al.
Long-Context State-Space Video World Models
Ryan Po, Yotam Nitzan, Richard Zhang et al.
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
Le Zhuo, Liangbing Zhao, Sayak Paul et al.
Ross3D: Reconstructive Visual Instruction Tuning with 3D-Awareness
Haochen Wang, Yucheng Zhao, Tiancai Wang et al.
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
VistaDream: Sampling multiview consistent images for single-view scene reconstruction
Haiping Wang, Yuan Liu, Ziwei Liu et al.
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
Yiren Song, Danze Chen, Mike Zheng Shou
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation
Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.
MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
Rongchang Xie, Chen Du, Ping Song et al.
LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity
Walid Bousselham, Angie Boggust, Sofian Chaybouti et al.
AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction
Zhen Xing, Qi Dai, Zejia Weng et al.
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks
Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Rongyao Fang, Chengqi Duan, Kun Wang et al.
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
ZIYU ZHU, Xilin Wang, Yixuan Li et al.
VSSD: Vision Mamba with Non-Causal State Space Duality
Yuheng Shi, Mingjia Li, Minjing Dong et al.
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
Ming Hu, Kun yuan, Yaling Shen et al.
RadGPT: Constructing 3D Image-Text Tumor Datasets
Pedro Bassi, Mehmet Yavuz, Ibrahim Ethem Hamamci et al.
Epona: Autoregressive Diffusion World Model for Autonomous Driving
Kaiwen Zhang, Zhenyu Tang, Xiaotao Hu et al.
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models
Yuxuan Cai, Jiangning Zhang, Haoyang He et al.
Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
Kyle Sargent, Kyle Hsu, Justin Johnson et al.
Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin et al.
MotionFollower: Editing Video Motion via Score-Guided Diffusion
Shuyuan Tu, Qi Dai, Zihao Zhang et al.
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
Tianwei Xiong, Jun Hao Liew, Zilong Huang et al.
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution
Rui Xie, Yinhong Liu, Penghao Zhou et al.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
Luca Barsellotti, Lorenzo Bianchi, Nicola Messina et al.
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models
Tianyu Fu, Tengxuan Liu, Qinghao Han et al.
Amodal3R: Amodal 3D Reconstruction from Occluded 2D Images
Tianhao Wu, Chuanxia Zheng, Frank Guan et al.
FonTS: Text Rendering With Typography and Style Controls
Wenda SHI, Yiren Song, Dengming Zhang et al.
MagicMirror: ID-Preserved Video Generation in Video Diffusion Transformers
Yuechen Zhang, YaoYang Liu, Bin Xia et al.
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes
XINJIE ZHANG, Zhening Liu, Yifan Zhang et al.
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
Ma Teng, Xiaojun Jia, Ranjie Duan et al.
IRASim: A Fine-Grained World Model for Robot Manipulation
Fangqi Zhu, Hongtao Wu, Song Guo et al.
Scalable Ranked Preference Optimization for Text-to-Image Generation
Shyamgopal Karthik, Huseyin Coskun, Zeynep Akata et al.
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
Yiwu Zhong, Zhuoming Liu, Yin Li et al.
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
Jiaqi Liao, Zhengyuan Yang, Linjie Li et al.
InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
Yuhui WU, Liyi Chen, Ruibin Li et al.
Reflect-DiT: Inference-Time Scaling for Text-to-Image Diffusion Transformers via In-Context Reflection
Shufan Li, Konstantinos Kallidromitis, Akash Gokul et al.
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation
Jiazhe Guo, Yikang Ding, Xiwu Chen et al.
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Boyu Chen, Zhengrong Yue, Siran Chen et al.
STIV: Scalable Text and Image Conditioned Video Generation
Zongyu Lin, Wei Liu, Chen Chen et al.
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
Zhong-Yu Li, Ruoyi Du, Juncheng Yan et al.
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction
Zewei Zhou, Hao Xiang, Zhaoliang Zheng et al.
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory
Runjia Li, Philip Torr, Andrea Vedaldi et al.
Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
Massimiliano Viola, Kevin Qu, Nando Metzger et al.
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei, Jiacong Wang, Haochen Wang et al.
OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation
Ding Zhong, Xu Zheng, Chenfei Liao et al.
PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
Fatemeh Ghezloo, Saygin Seyfioglu, Rustin Soraki et al.
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds
Chaojun Ni, Xiaofeng Wang, Zheng Zhu et al.
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Yue Li, Qi Ma, Runyi Yang et al.
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.
GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
Tian-Xing Xu, Xiangjun Gao, Wenbo Hu et al.
Towards a Unified Copernicus Foundation Model for Earth Vision
Yi Wang, Zhitong Xiong, Chenying Liu et al.
MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
Quanhao Li, Zhen Xing, Rui Wang et al.
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
Qiucheng Wu, Handong Zhao, Michael Saxon et al.
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao, Xiaotong Li, Yufeng Cui et al.
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction
Jiale Xu, Shenghua Gao, Ying Shan
CAD-Recode: Reverse Engineering CAD Code from Point Clouds
Danila Rukhovich, Elona Dupont, Dimitrios Mallis et al.
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
Yuqing Wang, Zhijie Lin, Yao Teng et al.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
Li, Nikolaos Tsagkas, Jifei Song et al.
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
Jieming Bian, Lei Wang, Letian Zhang et al.
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
Xindi Yang, Baolu Li, Yiming Zhang et al.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
Xiaobiao Du, Yida Wang, Haiyang Sun et al.
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Xiangxiang Chu, Renda Li, Yong Wang
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Tianqi Liu, Zihao Huang, Zhaoxi Chen et al.
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
Hongcheng Gao, Tianyu Pang, Chao Du et al.
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
Zhenxin Li, Shihao Wang, Shiyi Lan et al.
VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers
Yating Wang, Haoyi Zhu, Mingyu Liu et al.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Jiawei He, Danshi Li, Xinqiang Yu et al.
Neighboring Autoregressive Modeling for Efficient Visual Generation
Yefei He, Yuanyu He, Shaoxuan He et al.
Scalable Image Tokenization with Index Backpropagation Quantization
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge et al.
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye, Honglin Lin, Leyan Ou et al.
ViLLa: Video Reasoning Segmentation with Large Language Model
rongkun Zheng, Lu Qi, Xi Chen et al.
NavMorph: A Self-Evolving World Model for Vision-and-Language Navigation in Continuous Environments
Xuan Yao, Junyu Gao, Changsheng Xu
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
Yuanhao Cai, He Zhang, Kai Zhang et al.
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
Haoxuan Wang, Jinlong Peng, Qingdong He et al.
Video-T1: Test-time Scaling for Video Generation
Fangfu Liu, Hanyang Wang, Yimo Cai et al.
AllTracker: Efficient Dense Point Tracking at High Resolution
Adam Harley, Yang You, Yang Zheng et al.
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou, Alexander Vilesov, Xuehai He et al.
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
Dewei Zhou, Mingwei Li, Zongxin Yang et al.
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis
David Svitov, Pietro Morerio, Lourdes Agapito et al.
VMBench: A Benchmark for Perception-Aligned Video Motion Generation
Xinran Ling, Chen Zhu, Meiqi Wu et al.
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
Simon Boeder, Fabian Gigengack, Benjamin Risse
An Empirical Study of Autoregressive Pre-training from Videos
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.
FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro et al.
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
Yongsheng Yu, Ziyun Zeng, Haitian Zheng et al.
V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video
Jianqi Chen, Biao Zhang, Xiangjun Tang et al.
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
JIACHENG RUAN, Wenzhen Yuan, Xian Gao et al.
LONG3R: Long Sequence Streaming 3D Reconstruction
Zhuoguang Chen, Minghui Qin, Tianyuan Yuan et al.
Generating Multi-Image Synthetic Data for Text-to-Image Customization
Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen, Bohan Liu, Chenjia Li et al.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye et al.
Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos
Yi Chen, Yuying Ge, Weiliang Tang et al.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Junsung Park, Jungbeom Lee, Jongyoon Song et al.
Unleashing Vecset Diffusion Model for Fast Shape Generation
Zeqiang Lai, Zhao Yunfei, Zibo Zhao et al.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
Jinming Li, Yichen Zhu, Zhibin Tang et al.
Efficient Track Anything
Yunyang Xiong, Chong Zhou, Xiaoyu Xiang et al.
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Yingjie Chen, Yifang Men, Yuan Yao et al.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Feng yan, Fanfan Liu, Yiyang Huang et al.
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Jiarui Wang, Huiyu Duan, Yu Zhao et al.
Detect Anything 3D in the Wild
Hanxue Zhang, Haoran Jiang, Qingsong Yao et al.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
Yuping Wang, Xiangyu Huang, Xiaokang Sun et al.
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs
Soonbin Lee, Fangwen Shu, Yago Sanchez de la Fuente et al.
Referring to Any Person
Qing Jiang, Lin Wu, Zhaoyang Zeng et al.
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Jiale Cheng, Ruiliang Lyu, Xiaotao Gu et al.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li, Meng Tian, Zhenyu Lin et al.
LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
Juliette Marrie, Romain Menegaux, Michael Arbel et al.
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park, Can Cui, Yunsheng Ma et al.
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo, Yingying Zhang, Xue Yang et al.
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
Yabo Zhang, xinpeng zhou, Yihan Zeng et al.
Mobile Video Diffusion
Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas et al.
MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Hui Sun, Shiyin Lu, Huanyu Wang et al.
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul et al.
BVINet: Unlocking Blind Video Inpainting with Zero Annotations
zhiliang wu, Kerui Chen, Kun Li et al.
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Ming Li, Xin Gu, Fan Chen et al.
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei, Yujie Zhong, yingsen zeng et al.
FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
Tianyi Wei, Yifan Zhou, Dongdong Chen et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Chin-Yang Lin, Cheng Sun, Fu-En Yang et al.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
Weirong Chen, Ganlin Zhang, Felix Wimbauer et al.
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Yijing Lin, Mengqi Huang, Shuhan Zhuang et al.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
Ziyan Guo, Zeyu HU, Na Zhao et al.
ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing
Yulin Pan, Xiangteng He, Chaojie Mao et al.
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
Jiawen Zhu, YEW-SOON ONG, Chunhua Shen et al.
An OpenMind for 3D Medical Vision Self-supervised Learning
Tassilo Wald, Constantin Ulrich, Jonathan Suprijadi et al.
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
Yatian Pang, Bin Zhu, Bin Lin et al.
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi, Minjing Dong, Chang Xu
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Tsu-Jui Fu, Yusu Qian, Chen Chen et al.
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Yue Fan, Xiaojian Ma, Rongpeng Su et al.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
Zhijian Huang, Chengjian Feng, Baihui Xiao et al.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Yiran Qin, Li Kang, Xiufeng Song et al.
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen, Zehuan Wu, Yichen Liu et al.