Most Cited ICCV "multimodal pretraining" Papers
2,701 papers found • Page 3 of 14
Conference
CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab et al.
StableCodec: Taming One-Step Diffusion for Extreme Image Compression
Tianyu Zhang, Xin Luo, Li Li et al.
Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction
Luyao Tang, Kunze Huang, Yuxuan Yuan et al.
CL-Splats: Continual Learning of Gaussian Splatting with Local Optimization
Jan Ackermann, Jonas Kulhanek, Shengqu Cai et al.
MobileIE: An Extremely Lightweight and Effective ConvNet for Real-Time Image Enhancement on Mobile Devices
HAILONG YAN, Ao Li, Xiangtao Zhang et al.
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision
Chuang Yu, Jinmiao Zhao, Yunpeng Liu et al.
AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs
Sanjoy Chowdhury, Hanan Gani, Nishit Anand et al.
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces
Ziming Yu, Pan Zhou, Sike Wang et al.
Harnessing Uncertainty-aware Bounding Boxes for Unsupervised 3D Object Detection
Ruiyang Zhang, Hu Zhang, Zhedong Zheng
Event-based Tiny Object Detection: A Benchmark Dataset and Baselines
Nuo Chen, Chao Xiao, Yimian Dai et al.
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Javier Tirado-Garín, Javier Civera
FedVLA: Federated Vision-Language-Action Learning with Dual Gating Mixture-of-Experts for Robotic Manipulation
Cui Miao, Tao Chang, meihan wu et al.
Importance-Based Token Merging for Efficient Image and Video Generation
Haoyu Wu, Jingyi Xu, Hieu Le et al.
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
Jeong Hun Yeo, Minsu Kim, Chae Won Kim et al.
MultiverSeg: Scalable Interactive Segmentation of Biomedical Imaging Datasets with In-Context Guidance
Hallee Wong, Jose Javier Gonzalez Ortiz, John Guttag et al.
TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Jonas Belouadi, Eddy Ilg, Margret Keuper et al.
LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal
Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee et al.
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
Junhao Cheng, Yuying Ge, Yixiao Ge et al.
Distilling Parallel Gradients for Fast ODE Solvers of Diffusion Models
Beier Zhu, Ruoyu Wang, Tong Zhao et al.
MOSCATO: Predicting Multiple Object State Change Through Actions
Parnian Zameni, Yuhan Shen, Ehsan Elhamifar
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
Weili Zeng, Ziyuan Huang, Kaixiang Ji et al.
Learning Interpretable Queries for Explainable Image Classification with Information Pursuit
Stefan Kolek, Aditya Chattopadhyay, Kwan Ho Ryan Chan et al.
PhysRig: Differentiable Physics-Based Skinning and Rigging Framework for Realistic Articulated Object Modeling
Hao Zhang, Haolan Xu, Chun Feng et al.
DLF: Extreme Image Compression with Dual-generative Latent Fusion
Naifu Xue, Zhaoyang Jia, Jiahao Li et al.
OpenRSD: Towards Open-prompts for Object Detection in Remote Sensing Images
Ziyue Huang, Yongchao Feng, Ziqi Liu et al.
Fine-grained Spatiotemporal Grounding on Egocentric Videos
Shuo LIANG, Yiwu Zhong, Zi-Yuan Hu et al.
LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models
Yu Cheng, Fajie Yuan
Latent Diffusion Models with Masked AutoEncoders
Junho Lee, Jeongwoo Shin, Hyungwook Choi et al.
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers
Zhengyao Lyu, Tianlin Pan, Chenyang Si et al.
Dense2MoE: Restructuring Diffusion Transformer to MoE for Efficient Text-to-Image Generation
Youwei Zheng, Yuxi Ren, Xin Xia et al.
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes
Minkyun Seo, Hyungtae Lim, Kanghee Lee et al.
Benchmarking and Learning Multi-Dimensional Quality Evaluator for Text-to-3D Generation
Yujie Zhang, Bingyang Cui, Qi Yang et al.
egoPPG: Heart Rate Estimation from Eye-Tracking Cameras in Egocentric Systems to Benefit Downstream Vision Tasks
Björn Braun, Rayan Armani, Manuel Meier et al.
FreeMorph: Tuning-Free Generalized Image Morphing with Diffusion Model
Yukang Cao, Chenyang Si, Jinghao Wang et al.
Contrastive Test-Time Composition of Multiple LoRA Models for Image Generation
Tuna Meral, Enis Simsar, Federico Tombari et al.
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments
Lin Zeng, Boming Zhao, Jiarui Hu et al.
DreamFuse: Adaptive Image Fusion with Diffusion Transformer
Junjia Huang, Pengxiang Yan, Jiyang Liu et al.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
Byeongjun Park, Hyojun Go, Hyelin Nam et al.
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning
Qi Wang, Zhipeng Zhang, Baao Xie et al.
Hierarchical Cross-modal Prompt Learning for Vision-Language Models
Hao Zheng, Shunzhi Yang, Zhuoxin He et al.
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Divyansh Srivastava, Xiang Zhang, He Wen et al.
Edit360: 2D Image Edits to 3D Assets from Any Angle
Junchao Huang, Xinting Hu, Shaoshuai Shi et al.
Rectifying Magnitude Neglect in Linear Attention
Qihang Fan, Huaibo Huang, Yuang Ai et al.
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
Kaining Ying, Henghui Ding, Guangquan Jie et al.
SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization
Zhentao Tan, Ben Xue, Jian Jia et al.
SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding
Tianci Wen, Zhiang Liu, Yongchun Fang
Multi-turn Consistent Image Editing
Zijun Zhou, Yingying Deng, Xiangyu He et al.
Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration
Junyuan Deng, Wei Yin, Xiaoyang Guo et al.
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
Rui Hu, Yuxuan Zhang, Lianghui Zhu et al.
Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
Ao Ma, Jiasong Feng, Ke Cao et al.
DC-ControlNet: Decoupling Inter- and Intra-Element Conditions in Image Generation with Diffusion Models
hongji yang, Wencheng Han, Yucheng Zhou et al.
TokenUnify: Scaling Up Autoregressive Pretraining for Neuron Segmentation
Yinda Chen, Haoyuan Shi, Xiaoyu Liu et al.
X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation
jian ma, Qirong Peng, Xu Guo et al.
WaveMamba: Wavelet-Driven Mamba Fusion for RGB-Infrared Object Detection
Haodong Zhu, Wenhao Dong, Linlin Yang et al.
Adversarial Robust Memory-Based Continual Learner
Xiaoyue Mi, Fan Tang, Zonghan Yang et al.
Improving Multimodal Learning via Imbalanced Learning
Shicai Wei, Chunbo Luo, Yang Luo
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model
Jinhua Zhang, Hualian Sheng, Sijia Cai et al.
Gaussian Splatting with Discretized SDF for Relightable Assets
Zuo-Liang Zhu, jian Yang, Beibei Wang
PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning
Yan Zhang, Yao Feng, Alpár Cseke et al.
HERO: Human Reaction Generation from Videos
Chengjun Yu, Wei Zhai, Yuhang Yang et al.
Adaptive Hyper-Graph Convolution Network for Skeleton-based Human Action Recognition with Virtual Connections
Youwei Zhou, Tianyang Xu, Cong Wu et al.
SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality
Sijie Li, Chen Chen, Jungong Han
Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?
Yuru Jia, Valerio Marsocci, Ziyang Gong et al.
CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation
Haoxuan Wang, Zhenghao Zhao, Junyi Wu et al.
ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness
Boqian Li, Zeyu Cai, Michael Black et al.
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR
Christophe Bolduc, Yannick Hold-Geoffroy, Jean-Francois Lalonde
DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Model
Junjia Huang, Pengxiang Yan, Jinhang Cai et al.
Φ-GAN:Physics-Inspired GAN for Generating SAR Images Under Limited Data
Xidan Zhang, Yihan Zhuang, Qian Guo et al.
Multi-identity Human Image Animation with Structural Video Diffusion
Zhenzhi Wang, Yixuan Li, yanhong zeng et al.
Towards Immersive Human-X Interaction: A Real-Time Framework for Physically Plausible Motion Synthesis
Kaiyang Ji, Ye Shi, Zichen Jin et al.
Frequency-Dynamic Attention Modulation For Dense Prediction
Linwei Chen, Lin Gu, Ying Fu
Low-Light Image Enhancement using Event-Based Illumination Estimation
Lei Sun, Yuhan Bao, Jiajun Zhai et al.
SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image
Dimitrije Antić, Georgios Paschalidis, Shashank Tripathi et al.
CoHD: A Counting-Aware Hierarchical Decoding Framework for Generalized Referring Expression Segmentation
Zhuoyan Luo, Yinghao Wu, Tianheng Cheng et al.
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
Gengze Zhou, Yicong Hong, Zun Wang et al.
Learning Normal Flow Directly From Events
Dehao Yuan, Levi Burner, Jiayi Wu et al.
DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization
Wenchuan Wang, Mengqi Huang, Yijing Tu et al.
Make Your Training Flexible: Towards Deployment-Efficient Video Models
Chenting Wang, Kunchang Li, Tianxiang Jiang et al.
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Wenqi Zhang, Hang Zhang, Xin Li et al.
Bokehlicious: Photorealistic Bokeh Rendering with Controllable Apertures
Tim Seizinger, Florin-Alexandru Vasluianu, Marcos Conde et al.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
Peng-Hao Hsu, Ke Zhang, Fu-En Wang et al.
GM-MoE: Low-Light Enhancement with Gated-Mechanism Mixture-of-Experts
Minwen Liao, Hao Dong, Xinyi Wang et al.
NeurOp-Diff: Continuous Remote Sensing Image Super-Resolution via Neural Operator Diffusion
Zihao Xu, Yuzhi Tang, Bowen Xu et al.
Salvaging the Overlooked: Leveraging Class-Aware Contrastive Learning for Multi-Class Anomaly Detection
Lei Fan, Junjie Huang, Donglin Di et al.
CAVIS: Context-Aware Video Instance Segmentation
Seunghun Lee, Jiwan Seo, Kiljoon Han et al.
DH-FaceVid-1K: A Large-Scale High-Quality Dataset for Face Video Generation
Donglin Di, He Feng, Wenzhang SUN et al.
ToolVQA: A Dataset for Multi-step Reasoning VQA with External Tools
Shaofeng Yin, Ting Lei, Yang Liu
Generative Zoo
Tomasz Niewiadomski, Anastasios Yiannakidis, Hanz Cuevas Velasquez et al.
Humans as a Calibration Pattern: Dynamic 3D Scene Reconstruction from Unsynchronized and Uncalibrated Videos
Changwoon Choi, Jeongjun Kim, Geonho Cha et al.
TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction
Zewei Zhou, Zhihao Zhao, Tianhui Cai et al.
OuroMamba: A Data-Free Quantization Framework for Vision Mamba
Akshat Ramachandran, Mingyu Lee, Huan Xu et al.
CutS3D: Cutting Semantics in 3D for 2D Unsupervised Instance Segmentation
Leon Sick, Dominik Engel, Sebastian Hartwig et al.
RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Dongming Wu, Yanping Fu, Saike Huang et al.
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering
Shiyong Liu, Xiao Tang, Zhihao Li et al.
Learning to Generalize without Bias for Open-Vocabulary Action Recognition
Yating Yu, Congqi Cao, Yifan Zhang et al.
Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning
Tianyi Zhao, Boyang Liu, Yanglei Gao et al.
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Jie Xu, Na Zhao, Gang Niu et al.
Learning to Unlearn while Retaining: Combating Gradient Conflicts in Machine Unlearning
Gaurav Patel, Qiang Qiu
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
Chenwei Lin, Hanjia Lyu, Xian Xu et al.
DeGauss: Dynamic-Static Decomposition with Gaussian Splatting for Distractor-free 3D Reconstruction
Rui Wang, Quentin Lohmeyer, Mirko Meboldt et al.
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
Saarthak Kapse, Pushpak Pati, Srikar Yellapragada et al.
SAM4D: Segment Anything in Camera and LiDAR Streams
Jianyun Xu, Song Wang, Ziqian Ni et al.
Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing
Joonghyuk Shin, Alchan Hwang, Yujin Kim et al.
Hints of Prompt: Enhancing Visual Representation for Multimodal LLMs in Autonomous Driving
Hao Zhou, Zhanning Gao, Zhili Chen et al.
VideoLLaMB: Long Streaming Video Understanding with Recurrent Memory Bridges
Yuxuan Wang, Yiqi Song, Cihang Xie et al.
Sculpting Memory: Multi-Concept Forgetting in Diffusion Models via Dynamic Mask and Concept-Aware Optimization
Li, Yang Xiao, Jie Ji et al.
MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips
SHIBO WANG, Haonan He, Maria Parelli et al.
VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving
Ruifei Zhang, Wei Zhang, Xiao Tan et al.
A Token-level Text Image Foundation Model for Document Understanding
Tongkun Guan, Zining Wang, Pei Fu et al.
Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity
Sung Ju Lee, Nam Ik Cho
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization
Yueh-Cheng Liu, Lukas Höllein, Matthias Nießner et al.
SP2T: Sparse Proxy Attention for Dual-stream Point Transformer
Jiaxu Wan, Hong Zhang, Ziqi He et al.
LightSwitch: Multi-view Relighting with Material-guided Diffusion
Yehonathan Litman, Fernando De la Torre, Shubham Tulsiani
SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
Andreas Engelhardt, Mark Boss, Vikram Voleti et al.
TrustMark: Robust Watermarking and Watermark Removal for Arbitrary Resolution Images
Tu Bui, Shruti Agarwal, John Collomosse
Learning 3D Object Spatial Relationships from Pre-trained 2D Diffusion Models
Sangwon Baik, Hyeonwoo Kim, Hanbyul Joo
Balanced Image Stylization with Style Matching Score
Yuxin Jiang, Liming Jiang, Shuai Yang et al.
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
Teng Zhou, Xiaoyu Zhang, Yongchuan Tang
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes
Han-Hung Lee, Qinghong Han, Angel Chang
PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors
Kangan Qian, Jinyu Miao, Xinyu Jiao et al.
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
Ruijie Zhu, Mulin Yu, Linning Xu et al.
DuoLoRA : Cycle-consistent and Rank-disentangled Content-Style Personalization
Aniket Roy, Shubhankar Borse, Shreya Kadambi et al.
SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies
Liang Han, Xu Zhang, Haichuan Song et al.
MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network
Jianfei Jiang, Qiankun Liu, Haochen Yu et al.
LongAnimation: Long Animation Generation with Dynamic Global-Local Memory
Nan Chen, Mengqi Huang, Yihao Meng et al.
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
Shadi Hamdan, Chonghao Sima, Zetong Yang et al.
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery
Yanzhe Lyu, Kai Cheng, Kang Xin et al.
Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product
Paul Albert, Frederic Zhang, Hemanth Saratchandran et al.
Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation
Tiange Xiang, Kai Li, Chengjiang Long et al.
CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems
Aniket Rege, Zinnia Nie, Unmesh Raskar et al.
StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting
Shakiba Kheradmand, Delio Vicini, George Kopanas et al.
CanonSwap: High-Fidelity and Consistent Video Face Swapping via Canonical Space Modulation
Xiangyang Luo, Ye Zhu, Yunfei Liu et al.
CoMPaSS: Enhancing Spatial Understanding in Text-to-Image Diffusion Models
Gaoyang Zhang, Bingtao Fu, Qingnan Fan et al.
Knowledge Distillation with Refined Logits
Wujie Sun, Defang Chen, Siwei Lyu et al.
DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
Rui Yu, Xianghang Zhang, Runkai Zhao et al.
VGGSounder: Audio-Visual Evaluations for Foundation Models
Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu et al.
GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors
Kang DU, Zhihao Liang, Yulin Shen et al.
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
Wenwen Yu, Zhibo Yang, Yuliang Liu et al.
Auto-Regressively Generating Multi-View Consistent Images
JiaKui Hu, Yuxiao Yang, Jialun Liu et al.
GAP: Gaussianize Any Point Clouds with Text Guidance
Weiqi Zhang, Junsheng Zhou, Haotian Geng et al.
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
Zhen Zeng, Leijiang Gu, Xun Yang et al.
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang et al.
MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh
Shuangkang Fang, I-Chao Shen, Yufeng Wang et al.
iManip: Skill-Incremental Learning for Robotic Manipulation
Zexin Zheng, Jia-Feng Cai, Xiao-Ming Wu et al.
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting
Zhongpai Gao, Benjamin Planche, Meng Zheng et al.
SIMS: Simulating Stylized Human-Scene Interactions with Retrieval-Augmented Script Generation
Wenjia Wang, Liang Pan, Zhiyang Dou et al.
Motion Synthesis with Sparse and Flexible Keyjoint Control
Inwoo Hwang, Jinseok Bae, Donggeun Lim et al.
I2V3D: Controllable Image-to-video Generation with 3D Guidance
Zhiyuan Zhang, Dongdong Chen, Jing Liao
HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis
Timo Teufel, xilong zhou, Umar Iqbal et al.
Dynamic Multimodal Prototype Learning in Vision-Language Models
Xingyu Zhu, Shuo Wang, Beier Zhu et al.
HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction
Sara Rojas Martinez, Matthieu Armando, Bernard Ghanem et al.
Precise Action-to-Video Generation Through Visual Action Prompts
Yuang Wang, Chao Wen, Haoyu Guo et al.
VertexRegen: Mesh Generation with Continuous Level of Detail
Xiang Zhang, Yawar Siddiqui, Armen Avetisyan et al.
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation
Xincheng Shuai, Henghui Ding, Zhenyuan Qin et al.
Auto-Vocabulary Semantic Segmentation
Osman Ülger, Maksymilian Kulicki, Yuki Asano et al.
DC-AR: Efficient Masked Autoregressive Image Generation with Deep Compression Hybrid Tokenizer
Yecheng Wu, Han Cai, Junyu Chen et al.
DuCos: Duality Constrained Depth Super-Resolution via Foundation Model
Zhiqiang Yan, Zhengxue Wang, Haoye Dong et al.
Not All Frame Features Are Equal: Video-to-4D Generation via Decoupling Dynamic-Static Features
Liying Yang, Chen Liu, Zhenwei Zhu et al.
Synergistic Prompting for Robust Visual Recognition with Missing Modalities
Zhihui Zhang, Luanyuan Dai, Qika Lin et al.
EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
Yu-Cheng Lin, Yu-Syuan Xu, Hao-Wei Chen et al.
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization
Sihan Yang, Runsen Xu, Chenhang Cui et al.
Neural Shell Texture Splatting: More Details and Fewer Primitives
Xin Zhang, Anpei Chen, Jincheng Xiong et al.
Video Motion Graphs
Haiyang Liu, Zhan Xu, Fating Hong et al.
TF-TI2I: Training-Free Text-and-Image-to-Image Generation via Multi-Modal Implicit-Context Learning In Text-to-Image Models
Teng-Fang Hsiao, Bo-Kai Ruan, Yi-Lun Wu et al.
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning
Kelin Yu, Sheng Zhang, Harshit Soora et al.
Temporal Unlearnable Examples: Preventing Personal Video Data from Unauthorized Exploitation by Object Tracking
Qiangqiang Wu, Yi Yu, Chenqi Kong et al.
EgoM2P: Egocentric Multimodal Multitask Pretraining
Gen Li, Yutong Chen, Yiqian Wu et al.
UKBOB: One Billion MRI Labeled Masks for Generalizable 3D Medical Image Segmentation
Emmanuelle Bourigault, Amir Jamaludin, Abdullah Hamdi
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning
Ziqi Wang, Chang Che, Qi Wang et al.
Self-Calibrated Variance-Stabilizing Transformations for Real-World Image Denoising
Sébastien Herbreteau, Michael Unser
UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation
Chaitanya Patel, Hiroki Nakamura, Yuta Kyuragi et al.
Beyond [cls]: Exploring the True Potential of Masked Image Modeling Representations
Marcin Przewięźlikowski, Randall Balestriero, Wojciech Jasiński et al.
Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
Jun Li, Jinpeng Wang, Chaolei Tan et al.
C4D: 4D Made from 3D through Dual Correspondences
Shizun Wang, Zhenxiang Jiang, Xingyi Yang et al.
VOVTrack: Exploring the Potentiality in Raw Videos for Open-Vocabulary Multi-Object Tracking
Zekun Qian, Ruize Han, Junhui Hou et al.
MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion
Zihan Wang, Jeff Tan, Tarasha Khurana et al.
MikuDance: Animating Character Art with Mixed Motion Dynamics
Jiaxu Zhang, Xianfang Zeng, Xin Chen et al.
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
Kwon Byung-Ki, Qi Dai, Lee Hyoseok et al.
How Can Objects Help Video-Language Understanding?
Zitian Tang, Shijie Wang, Junho Cho et al.
HairCUP: Hair Compositional Universal Prior for 3D Gaussian Avatars
Byungjun Kim, Shunsuke Saito, Giljoo Nam et al.
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail
Chandan Yeshwanth, David Rozenberszki, Angela Dai
Exploiting Diffusion Prior for Task-driven Image Restoration
Jaeha Kim, Junghun Oh, Kyoung Mu Lee
Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
Chen Zhao, Xuan Wang, Tong Zhang et al.
Jigsaw++: Imagining Complete Shape Priors for Object Reassembly
Jiaxin Lu, Gang Hua, Qixing Huang
Heavy Labels Out! Dataset Distillation with Label Space Lightening
Ruonan Yu, Songhua Liu, Zigeng Chen et al.
MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation
Fu Rong, Meng Lan, Qian Zhang et al.
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
Mingxuan Wu, Huang Huang, Justin Kerr et al.
GT-Loc: Unifying When and Where in Images through a Joint Embedding Space
David G. Shatwell, Ishan Rajendrakumar Dave, Swetha Sirnam et al.
On the Generalization of Representation Uncertainty in Earth Observation
Spyros Kondylatos, Nikolaos Ioannis Bountos, Dimitrios Michail et al.
Joint Diffusion Models in Continual Learning
Paweł Skierś, Kamil Deja
ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling
Jinhyung Park, Javier Romero, Shunsuke Saito et al.
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features
Chancharik Mitra, Brandon Huang, Tianning Chai et al.
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
Fangfu Liu, Hao Li, Jiawei Chi et al.
VoteSplat: Hough Voting Gaussian Splatting for 3D Scene Understanding
Minchao Jiang, Shunyu Jia, Jiaming Gu et al.
ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering
Kaisi Guan, Zhengfeng Lai, Yuchong Sun et al.
RoboTron-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction
Yufeng Zhong, Chengjian Feng, Feng yan et al.
Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA
Zhixuan Li, Hyunse Yoon, Sanghoon Lee et al.
Joint Self-Supervised Video Alignment and Action Segmentation
Ali Shah Ali, Syed Ahmed Mahmood, Mubin Saeed et al.
ARGUS: Hallucination and Omission Evaluation in Video-LLMs
Ruchit Rawal, Reza Shirkavand, Heng Huang et al.
ResQ: A Novel Framework to Implement Residual Neural Networks on Analog Rydberg Atom Quantum Computers
Nicholas DiBrita, Jason Han, Tirthak Patel