Most Cited CVPR "exploration-exploitation tradeoff" Papers
5,589 papers found • Page 2 of 28
Conference
VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis
Linshan Wu, Jia-Xin Zhuang, Hao Chen
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
Yiran Qin, Enshen Zhou, Qichang Liu et al.
Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary
Leheng Zhang, Yawei Li, Xingyu Zhou et al.
SAI3D: Segment Any Instance in 3D Scenes
Yingda Yin, Yuzheng Liu, Yang Xiao et al.
Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis
Jiawen Li, Yuxuan Chen, Hongbo Chu et al.
GROUNDHOG: Grounding Large Language Models to Holistic Segmentation
Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.
Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering
Kim Youwang, Tae-Hyun Oh, Gerard Pons-Moll
Structure-Aware Sparse-View X-ray 3D Reconstruction
Yuanhao Cai, Jiahao Wang, Alan L. Yuille et al.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao, Lujing Xie, Haowei Zhang et al.
Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology
Andrew Song, Richard J. Chen, Tong Ding et al.
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.
COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction
Qihang Ma, Xin Tan, Yanyun Qu et al.
Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini, Mustafa Shukor, Xiujun Li et al.
LLaFS: When Large Language Models Meet Few-Shot Segmentation
Lanyun Zhu, Tianrun Chen, Deyi Ji et al.
ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion
Jiayu Yang, Ziang Cheng, Yunfei Duan et al.
MoCha-Stereo: Motif Channel Attention Network for Stereo Matching
Ziyang Chen, Wei Long, He Yao et al.
CoSeR: Bridging Image and Language for Cognitive Super-Resolution
Haoze Sun, Wenbo Li, Jianzhuang Liu et al.
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
Evonne Ng, Javier Romero, Timur Bagautdinov et al.
Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing
Yafei Zhang, Shen Zhou, Huafeng Li
GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning
Ye Yuan, Xueting Li, Yangyi Huang et al.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang et al.
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.
Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection
Zhiwei Yang, Jing Liu, Peng Wu
MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception
Thien-Minh Nguyen, Shenghai Yuan, Thien Nguyen et al.
SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion
Hsuan-I Ho, Jie Song, Otmar Hilliges
OpenBias: Open-set Bias Detection in Text-to-Image Generative Models
Moreno D', Incà, Elia Peruzzo et al.
Free3D: Consistent Novel View Synthesis without 3D Representation
Chuanxia Zheng, Andrea Vedaldi
Optimizing Diffusion Noise Can Serve As Universal Motion Priors
Korrawe Karunratanakul, Konpat Preechakul, Emre Aksan et al.
OmniGlue: Generalizable Feature Matching with Foundation Model Guidance
Hanwen Jiang, Arjun Karpur, Bingyi Cao et al.
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Luo, Xue Yang, Wenhan Dou et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
Towards Realistic Scene Generation with LiDAR Diffusion Models
Haoxi Ran, Vitor Guizilini, Yue Wang
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
Jiawang Bai, Kuofeng Gao, Shaobo Min et al.
One-Minute Video Generation with Test-Time Training
Jiarui Xu, Shihao Han, Karan Dalal et al.
Scaling Laws for Data Filtering— Data Curation cannot be Compute Agnostic
Sachin Goyal, Pratyush Maini, Zachary Lipton et al.
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
Kaiwen Zhang, Yifan Zhou, Xudong XU et al.
OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning
Haiyang Ying, Yixuan Yin, Jinzhi Zhang et al.
FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions
Zhen Liu, Hao Zhu, Qi Zhang et al.
Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
Shanshan Zhong, Zhongzhan Huang, Shanghua Gao et al.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, Liwei Wang
Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions
Oindrila Saha, Grant Horn, Subhransu Maji
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Zhenglin Huang, Jinwei Hu, Yiwei He et al.
Open-Vocabulary Video Anomaly Detection
Peng Wu, Xuerong Zhou, Guansong Pang et al.
MonoCD: Monocular 3D Object Detection with Complementary Depths
Longfei Yan, Pei Yan, Shengzhou Xiong et al.
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
Shuting He, Henghui Ding
SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution
Zhixuan Liang, Yao Mu, Hengbo Ma et al.
UniScene: Unified Occupancy-centric Driving Scene Generation
Bohan Li, Jiazhe Guo, Hongsi Liu et al.
NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild
Weining Ren, Zihan Zhu, Boyang Sun et al.
Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval
Jiamian Wang, Guohao Sun, Pichao Wang et al.
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
Jisu Nam, Heesu Kim, DongJae Lee et al.
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
guo, Tianwei Lin
Video Interpolation with Diffusion Models
Siddhant Jain, Daniel Watson, Aleksander Holynski et al.
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
Yuchao Gu, Yipin Zhou, Bichen Wu et al.
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.
IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation
Yizhi Song, Zhifei Zhang, Zhe Lin et al.
Source-Free Domain Adaptation with Frozen Multimodal Foundation Model
Song Tang, Wenxin Su, Mao Ye et al.
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Zhangyang Qi, Ye Fang, Zeyi Sun et al.
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
Junming Chen, Yunfei Liu, Jianan Wang et al.
Koala: Key Frame-Conditioned Long Video-LLM
Reuben Tan, Ximeng Sun, Ping Hu et al.
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
Yongting Zhang, Lu Chen, Guodong Zheng et al.
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
Sebastian Koch, Narunas Vaskevicius, Mirco Colosi et al.
StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu, Zhen Xing, Xintong Han et al.
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models
Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed et al.
PEEKABOO: Interactive Video Generation via Masked-Diffusion
Yash Jain, Anshul Nasery, Vibhav Vineet et al.
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction
Weiyi Lv, Yuhang Huang, NING Zhang et al.
HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention
Xiaolong Tang, Meina Kan, Shiguang Shan et al.
pix2gestalt: Amodal Segmentation by Synthesizing Wholes
Ege Ozguroglu, Ruoshi Liu, Dídac Surís et al.
ControlRoom3D: Room Generation using Semantic Proxy Rooms
Jonas Schult, Sam Tsai, Lukas Höllein et al.
Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding
Zhiheng Cheng, Qingyue Wei, Hongru Zhu et al.
DePT: Decoupled Prompt Tuning
Ji Zhang, Shihan Wu, Lianli Gao et al.
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
Point Cloud Pre-training with Diffusion Models
xiao zheng, Xiaoshui Huang, Guofeng Mei et al.
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
Hongjie Wang, Bhishma Dedhia, Niraj Jha
Driving Everywhere with Large Language Model Policy Adaptation
Boyi Li, Yue Wang, Jiageng Mao et al.
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
Linyi Jin, Richard Tucker, Zhengqi Li et al.
Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction
Junuk Cha, Jihyeon Kim, Jae Shin Yoon et al.
Seamless Human Motion Composition with Blended Positional Encodings
German Barquero, Sergio Escalera, Cristina Palmero
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li et al.
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
Edward LOO, Tianyu HUANG, Peng Li et al.
FedAS: Bridging Inconsistency in Personalized Federated Learning
Xiyuan Yang, Wenke Huang, Mang Ye
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
Yanzuo Lu, Manlin Zhang, Jinhua Ma et al.
Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction
Devikalyan Das, Christopher Wewer, Raza Yunus et al.
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Orr Zohar, Xiaohan Wang, Yann Dubois et al.
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives
Ronghui Li, Yuxiang Zhang, Yachao Zhang et al.
Stable Flow: Vital Layers for Training-Free Image Editing
Omri Avrahami, Or Patashnik, Ohad Fried et al.
Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors
Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.
Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID
Wentao Tan, Changxing Ding, Jiayu Jiang et al.
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
Zhiwu Qing, Shiwei Zhang, Jiayu Wang et al.
Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Lei Chen, Yuan Meng, Chen Tang et al.
Wonderland: Navigating 3D Scenes from a Single Image
Hanwen Liang, Junli Cao, Vidit Goel et al.
Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution
Guangyuan Li, Chen Rao, Juncheng Mo et al.
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Songhao Han, Wei Huang, Hairong Shi et al.
Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching
Shitong Shao, Zeyuan Yin, Muxin Zhou et al.
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Inhwan Bae, Junoh Lee, Hae-Gon Jeon
Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance
Tomer Garber, Tom Tirer
GoalFlow: Goal-Driven Flow Matching for Multimodal Trajectories Generation in End-to-End Autonomous Driving
Zebin Xing, Xingyu Zhang, Yang Hu et al.
Text2Loc: 3D Point Cloud Localization from Natural Language
Yan Xia, Letian Shi, Zifeng Ding et al.
3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting
Qi Wu, Janick Martinez Esturo, Ashkan Mirzaei et al.
EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation
Rang Meng, Xingyu Zhang, Yuming Li et al.
SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation
Yamei Chen, Yan Di, Guangyao Zhai et al.
RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
Peng Lu, Tao Jiang, Yining Li et al.
ReconDreamer: Crafting World Models for Driving Scene Reconstruction via Online Restoration
Chaojun Ni, Guosheng Zhao, Xiaofeng Wang et al.
MemFlow: Optical Flow Estimation and Prediction with Memory
Qiaole Dong, Yanwei Fu
WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild
Rolandos Alexandros Potamias, Jinglei Zhang, Jiankang Deng et al.
Text-Image Alignment for Diffusion-Based Perception
Neehar Kondapaneni, Markus Marks, Manuel Knott et al.
FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion
George Cazenavette, Avneesh Sud, Thomas Leung et al.
Task Singular Vectors: Reducing Task Interference in Model Merging
Antonio Andrea Gargiulo, Donato Crisostomi, Maria Sofia Bucarelli et al.
Multiple Object Tracking as ID Prediction
Ruopeng Gao, Ji Qi, Limin Wang
GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh
Jing Wen, Xiaoming Zhao, Jason Ren et al.
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Xiang Wang, Shiwei Zhang, Hangjie Yuan et al.
Goku: Flow Based Video Generative Foundation Models
Shoufa Chen, Chongjian GE, Yuqi Zhang et al.
Visual In-Context Prompting
Feng Li, Qing Jiang, Hao Zhang et al.
Dinomaly: The Less Is More Philosophy in Multi-Class Unsupervised Anomaly Detection
Jia Guo, Shuai Lu, Weihang Zhang et al.
SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World
Kiana Ehsani, Tanmay Gupta, Rose Hendrix et al.
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models
Shivangi Aneja, Justus Thies, Angela Dai et al.
AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond
Zixiang Zhou, Yu Wan, Baoyuan Wang
A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint
Xiaofeng Cong, Jie Gui, Jing Zhang et al.
PointOBB: Learning Oriented Object Detection via Single Point Supervision
Junwei Luo, Xue Yang, Yi Yu et al.
Enhancing Multimodal Cooperation via Sample-level Modality Valuation
Yake Wei, Ruoxuan Feng, Zihe Wang et al.
Accelerating Diffusion Sampling with Optimized Time Steps
Shuchen Xue, Zhaoqiang Liu, Fei Chen et al.
End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames
Shuming Liu, Chenlin Zhang, Chen Zhao et al.
MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World
Yining Hong, Zishuo Zheng, Peihao Chen et al.
MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures
Zhangyang Xiong, Chenghong Li, Kenkun Liu et al.
Describing Differences in Image Sets with Natural Language
Lisa Dunlap, Yuhui Zhang, Xiaohan Wang et al.
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing
Ajian Liu, Shuai Xue, Gan Jianwen et al.
Bilateral Propagation Network for Depth Completion
Jie Tang, Fei-Peng Tian, Boshi An et al.
Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification
kaijie ren, Lei Zhang
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
Zike Wu, Pan Zhou, YI Xuanyu et al.
DiffusionLight: Light Probes for Free by Painting a Chrome Ball
Pakkapon Phongthawee, Worameth Chinchuthakun, Nontaphat Sinsunthithet et al.
Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick, Guangxing Han, Rui Hou et al.
Discovering and Mitigating Visual Biases through Keyword Explanation
Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.
Few-Shot Object Detection with Foundation Models
Guangxing Han, Ser-Nam Lim
Matching Anything by Segmenting Anything
Siyuan Li, Lei Ke, Martin Danelljan et al.
Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network
wenqiao Li, Xiaohao Xu, Yao Gu et al.
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification
Pingping Zhang, Yuhao Wang, Yang Liu et al.
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
Chaoyi Zhang, Kevin Lin, Zhengyuan Yang et al.
On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning?
Maxime Zanella, Ismail Ben Ayed
Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval
Yucheng Suo, Fan Ma, Linchao Zhu et al.
SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments
Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
Shuai Yang, Yifan Zhou, Ziwei Liu et al.
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale
Baorui Ma, Huachen Gao, Haoge Deng et al.
Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects
Yijia Weng, Bowen Wen, Jonathan Tremblay et al.
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain, Jianwei Yang, Humphrey Shi
SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples
Phillip Howard, Avinash Madasu, Tiep Le et al.
DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari et al.
DAP: A Dynamic Adversarial Patch for Evading Person Detectors
Amira Guesmi, Ruitian Ding, Muhammad Abdullah Hanif et al.
Neural Markov Random Field for Stereo Matching
Tongfan Guan, Chen Wang, Yun-Hui Liu
MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation
Mi Yan, Jiazhao Zhang, Yan Zhu et al.
Generating Human Motion in 3D Scenes from Text Descriptions
Zhi Cen, Huaijin Pi, Sida Peng et al.
Language-driven All-in-one Adverse Weather Removal
Hao Yang, Liyuan Pan, Yan Yang et al.
MatFuse: Controllable Material Generation with Diffusion Models
Giuseppe Vecchio, Renato Sortino, Simone Palazzo et al.
OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers
Han Liang, Jiacheng Bao, Ruichi Zhang et al.
Mosaic-SDF for 3D Generative Models
Lior Yariv, Omri Puny, Oran Gafni et al.
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
Yu Zeng, Vishal M. Patel, Haochen Wang et al.
One-Prompt to Segment All Medical Images
Wu, Min Xu
SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection
JUNSU KIM, Hoseong Cho, Jihyeon Kim et al.
Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks
Yuhao Liu, Zhanghan Ke, Fang Liu et al.
Digital Life Project: Autonomous 3D Characters with Social Intelligence
Zhongang Cai, Jianping Jiang, Zhongfei Qing et al.
PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation
Zhenyu Li, Shariq Bhat, Peter Wonka
Grounded Question-Answering in Long Egocentric Videos
Shangzhe Di, Weidi Xie
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
Rui Chen, Jianfeng Zhang, Yixun Liang et al.
VLOGGER: Multimodal Diffusion for Embodied Avatar Synthesis
Enric Corona, Andrei Zanfir, Eduard Gabriel Bazavan et al.
M-LLM Based Video Frame Selection for Efficient Video Understanding
Kai Hu, Feng Gao, Xiaohan Nie et al.
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Fan-Yun Sun, Weiyu Liu, Siyi Gu et al.
SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction
Conghao Wong, Beihao Xia, Ziqian Zou et al.
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations
Linke Ouyang, Yuan Qu, Hongbin Zhou et al.
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints
Mingjie Pan, Jiyao Zhang, Tianshu Wu et al.
Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping
Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti et al.
Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners
Keon Hee Park, Kyungwoo Song, Gyeong-Moon Park
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
Katrin Renz, Long Chen, Elahe Arani et al.
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation
Minghong Cai, Xiaodong Cun, Xiaoyu Li et al.
Point Segment and Count: A Generalized Framework for Object Counting
Zhizhong Huang, Mingliang Dai, Yi Zhang et al.
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection
Lewei Yao, Renjie Pi, Jianhua Han et al.
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
Zhi Gao, Yuntao Du., Xintong Zhang et al.
Improving Image Restoration through Removing Degradations in Textual Representations
Jingbo Lin, Zhilu Zhang, Yuxiang Wei et al.
LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry
Weirong Chen, Le Chen, Rui Wang et al.
Pixel-level and Semantic-level Adjustable Super-resolution: A Dual-LoRA Approach
Lingchen Sun, Rongyuan Wu, Zhiyuan Ma et al.
GaussianWorld: Gaussian World Model for Streaming 3D Occupancy Prediction
Sicheng Zuo, Wenzhao Zheng, Yuanhui Huang et al.
Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles
Rui Song, Chenwei Liang, Hu Cao et al.
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions
Kai Chen, Yunhao Gou, Runhui Huang et al.
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation
Haotong Lin, Sida Peng, Jingxiao Chen et al.
Accurate Spatial Gene Expression Prediction by Integrating Multi-Resolution Features
Youngmin Chung, Ji Hun Ha, Kyeong Chan Im et al.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Lin Li, Haoyan Guan, Jianing Qiu et al.
Bridging Remote Sensors with Multisensor Geospatial Foundation Models
Boran Han, Shuai Zhang, Xingjian Shi et al.
S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data
Xuyang Li, Danfeng Hong, Jocelyn Chanussot
Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges
Tongtong Yuan, Xuange Zhang, Kun Liu et al.
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
Jianjian Cao, Peng Ye, Shengze Li et al.
LightIt: Illumination Modeling and Control for Diffusion Models
Peter Kocsis, Kalyan Sunkavalli, Julien Philip et al.
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
Yi Yu, Xue Yang, Qingyun Li et al.
LEAD: Learning Decomposition for Source-free Universal Domain Adaptation
Sanqing Qu, Tianpei Zou, Lianghua He et al.
Posterior Distillation Sampling
Juil Koo, Chanho Park, Minhyuk Sung
MET3R: Measuring Multi-View Consistency in Generated Images
Mohammad Asim, Christopher Wewer, Thomas Wimmer et al.
DAVE - A Detect-and-Verify Paradigm for Low-Shot Counting
Jer Pelhan, Alan Lukezic, Vitjan Zavrtanik et al.
Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation
Zhipeng Du, Miaojing Shi, Jiankang Deng
Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring
Xin Gao, Tianheng Qiu, Xinyu Zhang et al.
4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations
Wenbo Wang, Hsuan-I Ho, Chen Guo et al.