Most Cited CVPR Oral "graph processing" Papers
5,589 papers found • Page 1 of 28
Conference
DETRs Beat YOLOs on Real-time Object Detection
Yian Zhao, Wenyu Lv, Shangliang Xu et al.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang et al.
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Guanjun Wu, Taoran Yi, Jiemin Fang et al.
VBench: Comprehensive Benchmark Suite for Video Generative Models
Ziqi Huang, Yinan He, Jiashuo Yu et al.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen et al.
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
Ziyi Yang, Xinyu Gao, Wen Zhou et al.
VILA: On Pre-training for Visual Language Models
Ji Lin, Danny Yin, Wei Ping et al.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye et al.
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
Tao Lu, Mulin Yu, Linning Xu et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani et al.
One-step Diffusion with Distribution Matching Distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang et al.
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi et al.
SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula et al.
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song, Wenhao Chai, Guanhong Wang et al.
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng, Hang Zhang, Guanzheng Chen et al.
Generative Multimodal Models are In-Context Learners
Quan Sun, Yufeng Cui, Xiaosong Zhang et al.
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Bowen Wen, Wei Yang, Jan Kautz et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
Jing Shi, Wei Xiong, Zhe Lin et al.
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang et al.
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
Chi Yan, Delin Qu, Dong Wang et al.
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Shuhuai Ren, Linli Yao, Shicheng Li et al.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin, Ryuichi Takanobu, Cai Zhang et al.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.
Compact 3D Gaussian Representation for Radiance Field
Joo Chan Lee, Daniel Rho, Xiangyu Sun et al.
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Tianyu Yu, Yuan Yao, Haoye Zhang et al.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.
Text-to-3D using Gaussian Splatting
Zilong Chen, Feng Wang, Yikai Wang et al.
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
Shijie Zhou, Haoran Chang, Sicheng Jiang et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
Yiwen Chen, Zilong Chen, Chi Zhang et al.
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew et al.
Splatter Image: Ultra-Fast Single-View 3D Reconstruction
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi
Video-P2P: Video Editing with Cross-attention Control
Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Yujun Shi, Chuhui Xue, Jun Hao Liew et al.
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
Yihua Huang, Yangtian Sun, Ziyi Yang et al.
ReconFusion: 3D Reconstruction with Diffusion Priors
Rundi Wu, Ben Mildenhall, Philipp Henzler et al.
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu et al.
DeepCache: Accelerating Diffusion Models for Free
Xinyin Ma, Gongfan Fang, Xinchao Wang
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
Rongyuan Wu, Tao Yang, Lingchen Sun et al.
On Scaling Up a Multilingual Vision and Language Model
Xi Chen, Josip Djolonga, Piotr Padlewski et al.
OmniGen: Unified Image Generation
Shitao Xiao, Yueze Wang, Junjie Zhou et al.
VTimeLLM: Empower LLM to Grasp Video Moments
Bin Huang, Xin Wang, Hong Chen et al.
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
Taoran Yi, Jiemin Fang, Junjie Wang et al.
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Shelly Sheynin, Adam Polyak, Uriel Singer et al.
RoMa: Robust Dense Feature Matching
Johan Edstedt, Qiyu Sun, Georg Bökman et al.
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo, Jiangwei Lao, Bo Dang et al.
Continuous 3D Perception Model with Persistent State
Qianqian Wang, Yifei Zhang, Aleksander Holynski et al.
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
Jiahe Li, Jiawei Zhang, Xiao Bai et al.
Sequential Modeling Enables Scalable Learning for Large Vision Models
Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.
GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces
Yingwenqi Jiang, Jiadong Tu, Yuan Liu et al.
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
Yufei Wang, Wenhan Yang, Xinyuan Chen et al.
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim et al.
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue, Ning Yu, Shu Zhang et al.
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Jian Han, Jinlai Liu, Yi Jiang et al.
MambaOut: Do We Really Need Mamba for Vision?
Weihao Yu, Xinchao Wang
Putting the Object Back into Video Object Segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price et al.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
Jeongho Kim, Gyojung Gu, Minho Park et al.
RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
Ximiao Zhang, Min Xu, Xiuzhuang Zhou
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
Jin-Chuan Shi, Miao Wang, Haobin Duan et al.
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
Zhiqi Li, Zhiding Yu, Shiyi Lan et al.
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Sherwin Bahmani, Ivan Skorokhodov, Victor Rong et al.
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra, Brandon Huang, Trevor Darrell et al.
BioCLIP: A Vision Foundation Model for the Tree of Life
Samuel Stevens, Jiaman Wu, Matthew Thompson et al.
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
Zhijing Shao, Wang Zhaolong, Zhuang Li et al.
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
Junjie Wang, Jiemin Fang, Xiaopeng Zhang et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, Yihao Feng et al.
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Jingfeng Yao, Bin Yang, Xinggang Wang
Grounded Text-to-Image Synthesis with Attention Refocusing
Quynh Phung, Songwei Ge, Jia-Bin Huang
Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering
Zhiwen Yan, Weng Fei Low, Yu Chen et al.
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Mustikovela et al.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian liu et al.
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians
Yuelang Xu, Benwang Chen, Zhe Li et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed
Yifan Wang, Xingyi He, Sida Peng et al.
MMA-Diffusion: MultiModal Attack on Diffusion Models
Yijun Yang, Ruiyuan Gao, Xiaosen Wang et al.
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang et al.
Optimal Transport Aggregation for Visual Place Recognition
Sergio Izquierdo, Javier Civera
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.
Navigation World Models
Amir Bar, Gaoyue Zhou, Danny Tran et al.
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Chunlong Xia, Xinliang Wang, Feng Lv et al.
Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani, Amit Raj, Kevis-kokitsi Maninis et al.
GART: Gaussian Articulated Template Models
Jiahui Lei, Yufu Wang, Georgios Pavlakos et al.
XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies
Xuanchi Ren, Jiahui Huang, Xiaohui Zeng et al.
VLP: Vision Language Planning for Autonomous Driving
Chenbin Pan, Burhan Yaman, Tommaso Nesti et al.
Relightable Gaussian Codec Avatars
Shunsuke Saito, Gabriel Schwartz, Tomas Simon et al.
GSVA: Generalized Segmentation via Multimodal Large Language Models
Zhuofan Xia, Dongchen Han, Yizeng Han et al.
NeuRAD: Neural Rendering for Autonomous Driving
Adam Tonderski, Carl Lindström, Georg Hess et al.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Kevin Qinghong Lin, Linjie Li, Difei Gao et al.
Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion
Xunpeng Yi, Han Xu, HAO ZHANG et al.
Generalized Predictive Model for Autonomous Driving
Jiazhi Yang, Shenyuan Gao, Yihang Qiu et al.
MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds
Jiahui Lei, Yijia Weng, Adam W Harley et al.
WonderWorld: Interactive 3D Scene Generation from a Single Image
Hong-Xing Yu, Haoyi Duan, Charles Herrmann et al.
Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection
Chengjie Wang, wenbing zhu, Bin-Bin Gao et al.
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
Liao Qu, Huichao Zhang, Yiheng Liu et al.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang et al.
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
Lingyi Hong, Shilin Yan, Renrui Zhang et al.
EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection
Xuanyu Zhang, Runyi Li, Jiwen Yu et al.
Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers
Jinxia Xie, Bineng Zhong, Zhiyi Mo et al.
VideoBooth: Diffusion-based Video Generation with Image Prompts
Yuming Jiang, Tianxing Wu, Shuai Yang et al.
Towards Learning a Generalist Model for Embodied Navigation
Duo Zheng, Shijia Huang, Lin Zhao et al.
InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning
Yan-Shuo Liang, Wu-Jun Li
Hallucination Augmented Contrastive Learning for Multimodal Large Language Model
Chaoya Jiang, Haiyang Xu, Mengfan Dong et al.
Human Gaussian Splatting: Real-time Rendering of Animatable Avatars
Arthur Moreau, Jifei Song, Helisa Dhamo et al.
One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications
Mengyao Lyu, Yuhong Yang, Haiwen Hong et al.
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations
Fengyu Yang, Chao Feng, Ziyang Chen et al.
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Yazhou Xing, Yingqing He, Zeyue Tian et al.
VideoLLM-online: Online Video Large Language Model for Streaming Video
Joya Chen, Zhaoyang Lv, Shiwei Wu et al.
Can I Trust Your Answer? Visually Grounded Video Question Answering
Junbin Xiao, Angela Yao, Yicong Li et al.
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
Dewei Zhou, You Li, Fan Ma et al.
Efficient Test-Time Adaptation of Vision-Language Models
Adilbek Karmanov, Dayan Guan, Shijian Lu et al.
Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models
Xianfang Zeng, Xin Chen, Zhongqi Qi et al.
FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization
Jiahui Zhang, Fangneng Zhan, MUYU XU et al.
SimDA: Simple Diffusion Adapter for Efficient Video Generation
Zhen Xing, Qi Dai, Han Hu et al.
OMG-Seg: Is One Model Good Enough For All Segmentation?
Xiangtai Li, Haobo Yuan, Wei Li et al.
GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence
Van Nguyen Nguyen, Thibault Groueix, Mathieu Salzmann et al.
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
Rundi Wu, Ruiqi Gao, Ben Poole et al.
PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection
Xiaofan Li, Zhizhong Zhang, Xin Tan et al.
RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding
Jihan Yang, Runyu Ding, Weipeng DENG et al.
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment
Tianchen Deng, Guole Shen, Tong Qin et al.
Zero-Reference Low-Light Enhancement via Physical Quadruple Priors
Wenjing Wang, Huan Yang, Jianlong Fu et al.
Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
ZHIXIANG WEI, Lin Chen, Xiaoxiao Ma et al.
3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis
Zhicheng Lu, xiang guo, Le Hui et al.
FoundationStereo: Zero-Shot Stereo Matching
Bowen Wen, Matthew Trepte, Oluwaseun Joseph Aribido et al.
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Chuwei Luo, Yufan Shen, Zhaoqing Zhu et al.
SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design
Seokju Yun, Youngmin Ro
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee et al.
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
Rafail Fridman, Danah Yatim, Omer Bar-Tal et al.
HIPTrack: Visual Tracking with Historical Prompts
Wenrui Cai, Qingjie Liu, Yunhong Wang
Single-Model and Any-Modality for Video Object Tracking
Zongwei Wu, Jilai Zheng, Xiangxuan Ren et al.
GARField: Group Anything with Radiance Fields
Chung Min Kim, Mingxuan Wu, Justin Kerr et al.
Transformers without Normalization
Jiachen Zhu, Xinlei Chen, Kaiming He et al.
Self-correcting LLM-controlled Diffusion Models
Tsung-Han Wu, Long Lian, Joseph Gonzalez et al.
LLaVA-Critic: Learning to Evaluate Multimodal Models
Tianyi Xiong, Xiyao Wang, Dong Guo et al.
DEIM: DETR with Improved Matching for Fast Convergence
Shihua Huang, Zhichao Lu, Xiaodong Cun et al.
Generative Image Dynamics
Zhengqi Li, Richard Tucker, Noah Snavely et al.
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao et al.
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views
Shangzhan Zhang, Jianyuan Wang, Yinghao Xu et al.
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
Bin Xie, Jiale Cao, Jin Xie et al.
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji, Huajie Tan, Jiayu Shi et al.
DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing
Chong Mou, Xintao Wang, Jiechong Song et al.
VidToMe: Video Token Merging for Zero-Shot Video Editing
Xirui Li, Chao Ma, Xiaokang Yang et al.
Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation
Sihan liu, Yiwei Ma, Xiaoqing Zhang et al.
Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields
Leili Goli, Cody Reading, Silvia Sellán et al.
GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering
Abdullah J Hamdi, Luke Melas-Kyriazi, Jinjie Mai et al.
Boosting Adversarial Transferability by Block Shuffle and Rotation
Kunyu Wang, he xuanran, Wenxuan Wang et al.
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
Jonas Ricker, Denis Lukovnikov, Asja Fischer
DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models
Khawar Islam, Muhammad Zaigham Zaheer, Arif Mahmood et al.
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
Kyle Sargent, Zizhang Li, Tanmay Shah et al.
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
Bingliang Zhang, Wenda Chu, Julius Berner et al.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri et al.
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
Yifei Huang, Guo Chen, Jilan Xu et al.
Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation
Alexander Raistrick, Lingjie Mei, Karhan Kayan et al.
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
Guosheng Zhao, Chaojun Ni, Xiaofeng Wang et al.
Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining
Xiang Chen, Jinshan Pan, Jiangxin Dong
TUMTraf V2X Cooperative Perception Dataset
Walter Zimmer, Gerhard Arya Wardana, Suren Sritharan et al.
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
Xiao Wang, Shiao Wang, Chuanming Tang et al.
Video ReCap: Recursive Captioning of Hour-Long Videos
Md Mohaiminul Islam, Vu Bao Ngan Ho, Xitong Yang et al.
MambaIRv2: Attentive State Space Restoration
Hang Guo, Yong Guo, Yaohua Zha et al.
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution
Zhiyuan You, Xin Cai, Jinjin Gu et al.
IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection
Junbo Yin, Wenguan Wang, Runnan Chen et al.
InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization
Xiefan Guo, Jinlin Liu, Miaomiao Cui et al.
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang et al.
InstructVideo: Instructing Video Diffusion Models with Human Feedback
Hangjie Yuan, Shiwei Zhang, Xiang Wang et al.
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis
Nilesh Kulkarni, Davis Rempe, Kyle Genova et al.
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.
HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting
Yuheng Jiang, Zhehao Shen, Penghao Wang et al.
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu, Yi Jiang, Qihao Liu et al.
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement
Xiuquan Hou, Meiqin Liu, Senlin Zhang et al.
Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery
Mubashir Noman, Muzammal Naseer, Hisham Cholakkal et al.
CLIP-KD: An Empirical Study of CLIP Model Distillation
Chuanguang Yang, Zhulin An, Libo Huang et al.
Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
Zan Wang, Yixin Chen, Baoxiong Jia et al.
FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding
Jun Xiang, Xuan Gao, Yudong Guo et al.
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
Yutong Feng, Biao Gong, Di Chen et al.
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
Haiyang Liu, Zihao Zhu, Giorgio Becherini et al.
Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training
Xiaoyang Wu, Zhuotao Tian, Xin Wen et al.
CCEdit: Creative and Controllable Video Editing via Diffusion Models
Ruoyu Feng, Wenming Weng, Yanhui Wang et al.
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary
Leheng Zhang, Yawei Li, Xingyu Zhou et al.
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
Yiran Qin, Enshen Zhou, Qichang Liu et al.
Learning Multi-Dimensional Human Preference for Text-to-Image Generation
Sixian Zhang, Bohan Wang, Junqiang Wu et al.
SAI3D: Segment Any Instance in 3D Scenes
Yingda Yin, Yuzheng Liu, Yang Xiao et al.
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll
Structure-Aware Sparse-View X-ray 3D Reconstruction
Yuanhao Cai, Jiahao Wang, Alan L. Yuille et al.
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.
Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
Jiawen Li, Yuxuan Chen, Hongbo Chu et al.
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition
Feng Lu, Xiangyuan Lan, Lijun Zhang et al.