Most Cited ICCV "3d dataset construction" Papers
2,701 papers found • Page 2 of 14
Conference
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Yun Wang, Longguang Wang, Chenghao Zhang et al.
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang et al.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
Zhijian Huang, Chengjian Feng, Baihui Xiao et al.
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
Hyeonho Jeong, Suhyeon Lee, Jong Ye
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu, Longxiang Tang, Bohao PENG et al.
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
Yanzuo Lu, Yuxi Ren, Xin Xia et al.
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Qi Zhao, Xingyu Ni, Ziyu Wang et al.
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen, Zehuan Wu, Yichen Liu et al.
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Yue Fan, Xiaojian Ma, Rongpeng Su et al.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
Tatiana Zemskova, Dmitry Yudin
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu, Qize Yang, Yuan-Ming Li et al.
Learning Few-Step Diffusion Models by Trajectory Distribution Matching
Yihong Luo, Tianyang Hu, Jiacheng Sun et al.
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
Zhewei Dai, Shilei Zeng, Haotian Liu et al.
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Yikang Zhou, Tao Zhang, Shilin Xu et al.
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
Hao He, Ceyuan Yang, Shanchuan Lin et al.
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers
Yuedong Tan, Zongwei Wu, Yuqian Fu et al.
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Haiwen Huang, Anpei Chen, Volodymyr Havrylov et al.
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Jiangran Lyu, Ziming Li, Xuesong Shi et al.
Di[M]O: Distilling Masked Diffusion Models into One-step Generator
Yuanzhi Zhu, Xi WANG, Stéphane Lathuilière et al.
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
Yuxuan Wang, Xuanyu Yi, Haohan Weng et al.
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild
Xi Fang, Jiankun Wang, Xiaochen Cai et al.
3D Mesh Editing using Masked LRMs
William Gao, Dilin Wang, Yuchen Fan et al.
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
zijie wu, Chaohui Yu, Fan Wang et al.
Visual Test-time Scaling for GUI Agent Grounding
Tiange Luo, Lajanugen Logeswaran, Justin Johnson et al.
MagicColor: Multi-instance Sketch Colorization
yinhan Zhang, Yue Ma, Bingyuan Wang et al.
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Minghe Gao, Xuqi Liu, Zhongqi Yue et al.
A Recipe for Generating 3D Worlds from a Single Image
Katja Schwarz, Denis Rozumny, Samuel Rota Bulò et al.
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model
Shengqi Dang, Yi He, Long Ling et al.
WildSAT: Learning Satellite Image Representations from Wildlife Observations
Rangel Daroya, Elijah Cole, Oisin Mac Aodha et al.
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Yatai Ji, Jiacheng Zhang, Jie Wu et al.
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
Felix Krause, Timy Phan, Ming Gui et al.
SuperDec: 3D Scene Decomposition with Superquadrics Primitives
Elisabetta Fedele, Boyang Sun, Francis Engelmann et al.
Spectral Image Tokenizer
Carlos Esteves, Mohammed Suhail, Ameesh Makadia
RoMo: Robust Motion Segmentation Improves Structure from Motion
Lily Goli, Sara Sabour, Mark Matthews et al.
RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
yifei feng, Mx Yang, Shuhui Yang et al.
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
Jie Feng, Shengyuan Wang, Tianhui Liu et al.
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers
Hanling Zhang, Rundong Su, Zhihang Yuan et al.
Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling
Fengxiang Wang, Hongzhen Wang, Di Wang et al.
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang, Yaxiong Wang, Li Zhu et al.
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
Yuci Liang, Xinheng Lyu, Meidan Ding et al.
UAVScenes: A Multi-Modal Dataset for UAVs
Sijie Wang, Siqi Li, Yawei Zhang et al.
FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
Shuai Tan, Bill Gong, Bin Ji et al.
GAS: Generative Avatar Synthesis from a Single Image
Yixing Lu, Junting Dong, YoungJoong Kwon et al.
X-Dancer: Expressive Music to Human Dance Video Generation
Zeyuan Chen, Hongyi Xu, Guoxian Song et al.
Perspective-Invariant 3D Object Detection
Alan Liang, Lingdong Kong, Dongyue Lu et al.
Multimodal Latent Diffusion Model for Complex Sewing Pattern Generation
Shengqi Liu, Yuhao Cheng, Zhuo Chen et al.
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization
Alessio Spagnoletti, Jean Prost, Andres Almansa et al.
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding, Hao Wu, Yifan Yang et al.
Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
Du Chen, Liyi Chen, Zhengqiang ZHANG et al.
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting
Xiaoyu Zhou, Jingqi Wang, Yongtao Wang et al.
Make Me Happier: Evoking Emotions Through Image Diffusion Models
Qing Lin, Jingfeng Zhang, YEW-SOON ONG et al.
SMARTIES: Spectrum-Aware Multi-Sensor Auto-Encoder for Remote Sensing Images
Gencer Sumbul, Chang Xu, Emanuele Dalsasso et al.
SILO: Solving Inverse Problems with Latent Operators
Ron Raphaeli, Sean Man, Michael Elad
SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates
Yijia Hong, Yuan-Chen Guo, Ran Yi et al.
Multimodal LLM Guided Exploration and Active Mapping using Fisher Information
Wen Jiang, BOSHU LEI, Katrina Ashton et al.
MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj et al.
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Anurag Ghosh, Shen Zheng, Robert Tamburo et al.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Shijie Ma, Yuying Ge, Teng Wang et al.
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang, Changxu Cheng, Lingfeng Wang et al.
TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Ruijie Lu, Yixin Chen, Yu Liu et al.
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al.
Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
Jiwoo Chung, Sangeek Hyun, Hyunjun Kim et al.
ModSkill: Physical Character Skill Modularization
Yiming Huang, Zhiyang Dou, Lingjie Liu
NeuralSVG: An Implicit Representation for Text-to-Vector Generation
Sagi Polaczek, Yuval Alaluf, Elad Richardson et al.
Deeply Supervised Flow-Based Generative Models
Inkyu Shin, Chenglin Yang, Liang-Chieh Chen
From One to More: Contextual Part Latents for 3D Generation
Shaocong Dong, Lihe Ding, Xiao Chen et al.
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng, Yijiang Li, Wanpeng Zhang et al.
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering
Kai Ye, Chong Gao, Guanbin Li et al.
NullSwap: Proactive Identity Cloaking Against Deepfake Face Swapping
Tianyi Wang, Shuaicheng Niu, Harry Cheng et al.
Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation
Akshay Krishnan, Xinchen Yan, Vincent Casser et al.
GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis
Bo Liu, Ke Zou, Li-Ming Zhan et al.
Generating Physically Stable and Buildable Brick Structures from Text
Ava Pun, Kangle Deng, Ruixuan Liu et al.
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions
Mengwei Xie, Shuang Zeng, Xinyuan Chang et al.
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception
Jiaru Zhong, Jiahao Wang, Jiahui Xu et al.
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo, Thao Nguyen, Xun Huang et al.
DynamicID: Zero-Shot Multi-ID Image Personalization with Flexible Facial Editability
Xirui Hu, Jiahao Wang, Hao chen et al.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei, Yijun Yang, Junliang Xing et al.
PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining
Ciyu Ruan, Ruishan Guo, Zihang GONG et al.
DAViD: Modeling Dynamic Affordance of 3D Objects Using Pre-trained Video Diffusion Models
Hyeonwoo Kim, Sangwon Baik, Hanbyul Joo
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
Ruofan Wang, Juncheng Li, Yixu Wang et al.
ZIM: Zero-Shot Image Matting for Anything
Beomyoung Kim, Chanyong Shin, Joonhyun Jeong et al.
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu, Yue Wu, Meng Chu et al.
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models
Ke Niu, Haiyang Yu, Mengyang Zhao et al.
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Zhixi Cai, Fucai Ke, Simindokht Jahangard et al.
ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds
Binbin Xiang, Maciej Wielgosz, Stefano Puliti et al.
DIVE: Taming DINO for Subject-Driven Video Editing
Yi Huang, Wei Xiong, He Zhang et al.
FlowR: Flowing from Sparse to Dense 3D Reconstructions
Tobias Fischer, Samuel Rota Bulò, Yung-Hsu Yang et al.
HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID
Yiyang Su, Yunping Shi, Feng Liu et al.
VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models
Taesung Kwon, Jong Ye
VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling
Hyojun Go, Byeongjun Park, Hyelin Nam et al.
Vision-Language Models Can't See the Obvious
YASSER ABDELAZIZ DAHOU DJILALI, Ngoc Huynh, Phúc Lê Khắc et al.
Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing
Hongyu Shen, Junfeng Ni, Weishuo Li et al.
BANet: Bilateral Aggregation Network for Mobile Stereo Matching
Gangwei Xu, Jiaxin Liu, Xianqi Wang et al.
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent
Xinyao Liao, Xianfang Zeng, Liao Wang et al.
Fix-CLIP: Dual-Branch Hierarchical Contrastive Learning via Synthetic Captions for Better Understanding of Long Text
Bingchao Wang, Zhiwei Ning, Jianyu Ding et al.
StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams
Yang LI, Jinglu Wang, Lei Chu et al.
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Huanjin Yao, Jiaxing Huang, Yawen Qiu et al.
MUNBa: Machine Unlearning via Nash Bargaining
Jing Wu, Mehrtash Harandi
Language Driven Occupancy Prediction
Zhu Yu, Bowen Pang, Lizhe Liu et al.
Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos
Rundong Luo, Matthew Wallingford, Ali Farhadi et al.
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Rohit Gandikota, Zongze Wu, Richard Zhang et al.
Extrapolated Urban View Synthesis Benchmark
Xiangyu Han, Zhen Jia, Boyi Li et al.
GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography
Mengchen Zhang, Tong Wu, Jing Tan et al.
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
Jingjing Ren, Wenbo Li, Zhongdao Wang et al.
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior
Zichen Liu, Yihao Meng, Hao Ouyang et al.
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations
Xiang Xu, Lingdong Kong, Song Wang et al.
SyncDiff: Synchronized Motion Diffusion for Multi-Body Human-Object Interaction Synthesis
Wenkun He, Yun Liu, Ruitao Liu et al.
RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors
Avinash Paliwal, xilong zhou, Wei Ye et al.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu, Wenhang Ge, Dicong Qiu et al.
PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models
Runze He, bo cheng, Yuhang Ma et al.
DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
Yiren Song, Xiaokang Liu, Mike Zheng Shou
Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning
Saemi Moon, Minjong Lee, Sangdon Park et al.
AgroBench: Vision-Language Model Benchmark in Agriculture
Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka et al.
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
Donald Shenaj, Ondrej Bohdal, Mete Ozay et al.
Cycle Consistency as Reward: Learning Image-Text Alignment without Human Preferences
Hyojin Bahng, Caroline Chan, Fredo Durand et al.
ARIG: Autoregressive Interactive Head Generation for Real-time Conversations
Ying Guo, Xi Liu, Cheng Zhen et al.
DisCoRD: Discrete Tokens to Continuous Motion via Rectified Flow Decoding
Jungbin Cho, Junwan Kim, Jisoo Kim et al.
TruthPrInt: Mitigating Large Vision-Language Models Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Jinhao Duan, Fei Kong, Hao Cheng et al.
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning
Jingjing Jiang, Chao Ma, Xurui Song et al.
Unified Multimodal Understanding via Byte-Pair Visual Encoding
Wanpeng Zhang, Yicheng Feng, Hao Luo et al.
Training-Free Text-Guided Image Editing with Visual Autoregressive Model
Yufei Wang, Lanqing Guo, Zhihao Li et al.
GaussRender: Learning 3D Occupancy with Gaussian Rendering
Loick Chambon, Eloi Zablocki, Alexandre Boulch et al.
KinMo: Kinematic-aware Human Motion Understanding and Generation
Pengfei Zhang, Pinxin Liu, Pablo Garrido et al.
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
Yingying Zhang, Lixiang Ru, Kang Wu et al.
Rethinking Bimanual Robotic Manipulation: Learning with Decoupled Interaction Framework
Jian-Jian Jiang, Xiao-Ming Wu, Yi-Xiang He et al.
ZeroStereo: Zero-shot Stereo Matching from Single Images
Xianqi Wang, Hao Yang, Gangwei Xu et al.
StableCodec: Taming One-Step Diffusion for Extreme Image Compression
Tianyu Zhang, Xin Luo, Li Li et al.
Event-based Tiny Object Detection: A Benchmark Dataset and Baselines
Nuo Chen, Chao Xiao, Yimian Dai et al.
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image
Wonseok Roh, Hwanhee Jung, JongWook Kim et al.
Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer
Qingyu Shi, Jianzong Wu, Jinbin Bai et al.
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images
Yu Sheng, Jiajun Deng, Xinran Zhang et al.
SynFER: Towards Boosting Facial Expression Recognition with Synthetic Data
Xilin He, Cheng Luo, Xiaole Xian et al.
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
Ziming Yu, Pan Zhou, Sike Wang et al.
MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
HAILONG YAN, Ao Li, Xiangtao Zhang et al.
Emulating Self-attention with Convolution for Efficient Image Super-Resolution
Dongheon Lee, Seokju Yun, Youngmin Ro
FlashDepth: Real-time Streaming Video Depth Estimation at 2K Resolution
Gene Chou, Wenqi Xian, Guandao Yang et al.
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning
Zhi Chen, Zecheng Zhao, Jingcai Guo et al.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions
Jiaxin Lu, Chun-Hao Huang, Uttaran Bhattacharya et al.
ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving
Yuhang Lu, Jiadong Tu, Yuexin Ma et al.
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Javier Tirado-Garín, Javier Civera
``Principal Components" Enable A New Language of Images
Xin Wen, Bingchen Zhao, Ismail Elezi et al.
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting
Yusen XIE, Zhenmin Huang, Jin Wu et al.
CODA: Repurposing Continuous VAEs for Discrete Tokenization
Zeyu Liu, Zanlin Ni, Yeguo Hua et al.
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
Yanrui Bin, Wenbo Hu, Haoyuan Wang et al.
VALLR: Visual ASR Language Model for Lip Reading
Marshall Thomas, Edward Fish, Richard Bowden
Federated Continual Instruction Tuning
Haiyang Guo, Fanhu Zeng, Fei Zhu et al.
GIViC: Generative Implicit Video Compression
Ge Gao, Siyue Teng, Tianhao Peng et al.
REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents
Rui Tian, Qi Dai, Jianmin Bao et al.
Textured 3D Regenerative Morphing with 3D Diffusion Prior
Songlin Yang, Yushi LAN, Honghua Chen et al.
RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes
Pou-Chun Kung, Skanda Harisha, Ram Vasudevan et al.
Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction
Luyao Tang, Kunze Huang, Yuxuan Yuan et al.
Dual-Process Image Generation
Grace Luo, Jonathan Granskog, Aleksander Holynski et al.
WildSeg3D: Segment Any 3D Objects in the Wild from 2D Images
Yansong Guo, Jie Hu, Yansong Qu et al.
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
Guanning Zeng, Xiang Zhang, Zirui Wang et al.
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li, Yanqing Liu, Haoqin Tu et al.
OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Huanpeng Chu, Wei Wu, Guanyu Feng et al.
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
Yunheng Li, Yuxuan Li, Quan-Sheng Zeng et al.
Cross-modal Ship Re-Identification via Optical and SAR Imagery: A Novel Dataset and Method
Han Wang, Shengyang Li, Jian Yang et al.
Always Skip Attention
Yiping Ji, Hemanth Saratchandran, Peyman Moghadam et al.
Open-Vocabulary Octree-Graph for 3D Scene Understanding
Zhigang Wang, Yifei Su, Chenhui Li et al.
Learned Image Compression with Hierarchical Progressive Context Modeling
Yuqi Li, Haotian Zhang, Li Li et al.
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning
Fucai Ke, Vijay Kumar b g, Xingjian Leng et al.
Where, What, Why: Towards Explainable Driver Attention Prediction
Yuchen Zhou, Jiayu Tang, Xiaoyan Xiao et al.
Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models
Yudong Jin, Sida Peng, Xuan Wang et al.
R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception
Jonas Mirlach, Lei Wan, Andreas Wiedholz et al.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
Chuang Yu, Jinmiao Zhao, Yunpeng Liu et al.
Multimodal LLMs as Customized Reward Models for Text-to-Image Generation
Shijie Zhou, Ruiyi Zhang, Huaisheng Zhu et al.
CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
Jan Ackermann, Jonas Kulhanek, Shengqu Cai et al.
Bringing RNNs Back to Efficient Open-Ended Video Understanding
Weili Xu, Enxin Song, Wenhao Chai et al.
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
Model Reveals What to Cache: Profiling-Based Feature Reuse for Video Diffusion Models
Xuran Ma, Yexin Liu, Yaofu LIU et al.
PhysSplat: Efficient Physics Simulation for 3D Scenes via MLLM-Guided Gaussian Splatting
Haoyu Zhao, Hao Wang, Xingyue Zhao et al.
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu, Jingwei Sun, Yueqian Lin et al.
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao, Isaac Chung, Imene Kerboua et al.
DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation
Runze Zhang, Guoguang Du, Xiaochuan Li et al.
Snakes and Ladders: Two Steps Up for VideoMamba
Hui Lu, Albert Ali Salah, Ronald Poppe
Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection
Ruiyang Zhang, Hu Zhang, Zhedong Zheng
GazeGaussian: High-Fidelity Gaze Redirection with 3D Gaussian Splatting
Xiaobao Wei, Peng Chen, Guangyu Li et al.
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin, Jeongsoo Choi, Puyuan Peng et al.
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury, Hanan Gani, Nishit Anand et al.
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis
Letian Zhang, Quan Cui, Bingchen Zhao et al.
TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning
Siqi Luo, Haoran Yang, Yi Xin et al.
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection
Haodong Zhu, Wenhao Dong, Linlin Yang et al.
Multi-turn Consistent Image Editing
Zijun Zhou, Yingying Deng, Xiangyu He et al.
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
Christophe Bolduc, Yannick Hold-Geoffroy, Jean-Francois Lalonde
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Rui Hu, Yuxuan Zhang, Lianghui Zhu et al.
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
Yan Zhang, Yao Feng, Alpár Cseke et al.
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
Zhuoyan Luo, Yinghao Wu, Tianheng Cheng et al.
Multi-identity Human Image Animation with Structural Video Diffusion
Zhenzhi Wang, Yixuan Li, yanhong zeng et al.
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
Kaiyang Ji, Ye Shi, Zichen Jin et al.
Make Your Training Flexible: Towards Deployment-Efficient Video Models
Chenting Wang, Kunchang Li, Tianxiang Jiang et al.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Gengze Zhou, Yicong Hong, Zun Wang et al.
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
Dimitrije Antić, Georgios Paschalidis, Shashank Tripathi et al.
Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures
Tim Seizinger, Florin-Alexandru Vasluianu, Marcos Conde et al.
Φ-GAN:Physics-Inspired GAN for Generating SAR Images Under Limited Data
Xidan Zhang, Yihan Zhuang, Qian Guo et al.
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
Yuru Jia, Valerio Marsocci, Ziyang Gong et al.