🧬Multimodal

Visual Grounding

Localizing objects from text descriptions

100 papers3,458 total citations
Compare with other topics
Feb '24 Jan '26668 papers
Also includes: visual grounding, referring expression, grounded generation

Top Papers

#1

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024
550
citations
#2

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

CVPR 2024
171
citations
#3

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson et al.

AAAI 2025
137
citations
#4

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang, Hongyang Li, Feng Li et al.

ECCV 2024arXiv:2312.02949
grounded visual chatlarge multimodal modelsvisual groundingsegmentation models+3
114
citations
#5

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng DENG et al.

CVPR 2024
103
citations
#6

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

CVPR 2024
101
citations
#7

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.

ECCV 2024
86
citations
#8

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

CVPR 2024
75
citations
#9

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024
74
citations
#10

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon et al.

CVPR 2024
58
citations
#11

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding et al.

CVPR 2024
54
citations
#12

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

NeurIPS 2025
53
citations
#13

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087
vision-language navigationuav navigationtrajectory generationmultimodal understanding+4
52
citations
#14

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin et al.

ECCV 2024
50
citations
#15

Discovering and Mitigating Visual Biases through Keyword Explanation

Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.

CVPR 2024
50
citations
#16

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025
49
citations
#17

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.

CVPR 2024
49
citations
#18

HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios

HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp et al.

CVPR 2024
42
citations
#19

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

ICCV 2025
36
citations
#20

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Jiuming Liu, Dong Zhuo, Zhiheng Feng et al.

ECCV 2024arXiv:2403.18274
visual-lidar fusionodometry estimationstructure alignmentlocal-to-global fusion+3
36
citations
#21

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Yun Li, Yiming Zhang, Tao Lin et al.

ICCV 2025
36
citations
#22

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu, Kanzhi Cheng, Rui Yang et al.

NeurIPS 2025
36
citations
#23

Mono3DVG: 3D Visual Grounding in Monocular Images

Yangfan Zhan, Yuan Yuan, Zhitong Xiong

AAAI 2024arXiv:2312.08022
3d visual groundingmonocular imagesgeometry-text embeddingsdepth prediction+4
35
citations
#24

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding et al.

NeurIPS 2025arXiv:2502.13143
semantic orientation6-dof manipulationspatial reasoningzero-shot prediction+4
33
citations
#25

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

ICCV 2025
32
citations
#26

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis et al.

CVPR 2024
32
citations
#27

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

CVPR 2025arXiv:2503.06287
visual groundingattention headsvision-language modelsmultimodal capabilities+3
31
citations
#28

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

ye junyan, Zhutao Lv, Li Weijia et al.

ECCV 2024
31
citations
#29

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng et al.

CVPR 2024
31
citations
#30

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Damiano Marsili, Rohun Agrawal, Yisong Yue et al.

CVPR 2025
30
citations
#31

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu et al.

CVPR 2025
30
citations
#32

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Jianing "Jed" Yang, Xuweiyi Chen, Nikhil Madaan et al.

CVPR 2025
30
citations
#33

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

ECCV 2024
30
citations
#34

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

Hoang-Quan Nguyen, Thanh-Dat Truong, Xuan-Bac Nguyen et al.

CVPR 2024
29
citations
#35

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal, Christos Sakaridis, Suman Saha et al.

ECCV 2024arXiv:2309.04561
3d visual groundingdense 3d groundinginstance segmentationverbo-visual fusion+4
29
citations
#36

UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence

Ruihai Wu, Haoran Lu, Yiyan Wang et al.

CVPR 2024
29
citations
#37

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Geng Li, Jinglin Xu, Yunzhen Zhao et al.

CVPR 2025
28
citations
#38

Improved baselines for vision-language pre-training

Jakob Verbeek, Enrico Fini, Michal Drozdzal et al.

ICLR 2024
26
citations
#39

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang et al.

CVPR 2025
25
citations
#40

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.

ICCV 2025
25
citations
#41

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

ICCV 2025
24
citations
#42

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025arXiv:2507.04047
embodied scene understanding3d vision-language learningvisual groundingactive perception+4
24
citations
#43

MANUS: Markerless Grasp Capture using Articulated 3D Gaussians

Chandradeep Pokhariya, Ishaan Shah, Angela Xing et al.

CVPR 2024
23
citations
#44

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang et al.

CVPR 2025
23
citations
#45

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Le Yang, Ziwei Zheng, Boxu Chen et al.

CVPR 2025
22
citations
#46

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.

CVPR 2024
22
citations
#47

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

Keyang Zhou, Bharat Lal Bhatnagar, Jan Lenssen et al.

CVPR 2024
22
citations
#48

LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

Er Jin, Qihui Feng, Yongli Mou et al.

AAAI 2025
20
citations
#49

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng, Xinhao Cai, Qingchao Chen et al.

ECCV 2024arXiv:2408.16219
video temporal groundingvision-language modelslarge language modelszero-shot learning+3
20
citations
#50

Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination

Qingwang Zhang, Yingying Zhu

AAAI 2024
20
citations
#51

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

Huajian Huang, Changkun Liu, Yipeng Zhu et al.

CVPR 2024
20
citations
#52

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

CVPR 2024
19
citations
#53

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Model

Yue Zhang, Zhiyang Xu, Ying Shen et al.

ICLR 2025arXiv:2410.03878
3d scene understandinglarge language modelsspatial reasoning3d visual representations+3
19
citations
#54

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

ICCV 2025
19
citations
#55

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Li Mi, Chang Xu, Javiera Castillo Navarro et al.

ECCV 2024arXiv:2403.13965
cross-view geo-localizationground view variationsorientation invariancefield of view resilience+3
19
citations
#56

Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Zhi Cai, Yingjie Gao, Yaoyan Zheng et al.

ECCV 2024
18
citations
#57

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian et al.

AAAI 2025
18
citations
#58

Zero-Shot Aerial Object Detection with Visual Description Regularization

Chenyu Lin, Zhengqing Zang, Chenwei Tang et al.

AAAI 2024arXiv:2402.18233
zero-shot detectionaerial object detectionvisual description regularizationsemantic-visual correlation+4
18
citations
#59

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor, Boaz Lerner, Matan Levy et al.

ICLR 2025
18
citations
#60

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang, Baoxiong Jia, Yan Wang et al.

CVPR 2025
17
citations
#61

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Tien Toan Nguyen, Minh Nhat Nhat Vu, Baoru Huang et al.

ECCV 2024arXiv:2407.13842
6-dof grasp detectionlanguage-driven roboticspoint cloud processingnegative prompt guidance+4
17
citations
#62

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation

Yuchen Su, Zhineng Chen, Zhiwen Shao et al.

AAAI 2024arXiv:2306.15142
scene text detectionlow-rank approximationarbitrary-shaped textshape representation+4
17
citations
#63

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Haoxuanye Ji, Pengpeng Liang, Erkang Cheng

CVPR 2024
17
citations
#64

Motion Prior Knowledge Learning with Homogeneous Language Descriptions for Moving Infrared Small Target Detection

Shengjia Chen, Luping Ji, Weiwei Duan et al.

AAAI 2025
16
citations
#65

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

ICCV 2025
16
citations
#66

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

eslam Abdelrahman, Mohamed Ayman Mohamed, Mahmoud Ahmed et al.

ICLR 2024
16
citations
#67

Grounded Object-Centric Learning

Avinash Kori, Francesco Locatello, Fabio De Sousa Ribeiro et al.

ICLR 2024
16
citations
#68

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

Wei Deng, Mengshi Qi, Huadong Ma

CVPR 2025
16
citations
#69

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Nicolas Dufour, Vicky Kalogeiton, David Picard et al.

CVPR 2025arXiv:2412.06781
visual geolocationgenerative geolocationdiffusion modelsriemannian flow matching+3
16
citations
#70

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang, Xinyi Chen, Yilun Chen et al.

CVPR 2025arXiv:2504.21530
robotic manipulationgrounding masksvision-language modelsintermediate representations+3
15
citations
#71

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

CVPR 2024
15
citations
#72

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chen-Wei Xie, Haiyang Wang et al.

NeurIPS 2025arXiv:2503.01342
fine-grained visual perceptionopen-ended language interfaceinstance segmentationsemantic segmentation+4
14
citations
#73

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen, Jingwen Chen, Yingwei Pan et al.

ECCV 2024
14
citations
#74

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

Lei Fan, Dongdong Fan, Zhiguang Hu et al.

CVPR 2025
14
citations
#75

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang et al.

NeurIPS 2025
13
citations
#76

F3Loc: Fusion and Filtering for Floorplan Localization

Changan Chen, Rui Wang, Christoph Vogel et al.

CVPR 2024
13
citations
#77

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, peishan cong et al.

ECCV 2024arXiv:2304.05645
3d visual groundinglarge-scale dynamic scenesmulti-modal visual datanatural language descriptions+3
13
citations
#78

Where am I? Scene Retrieval with Language

Jiaqi Chen, Daniel Barath, Iro Armeni et al.

ECCV 2024arXiv:2404.14565
scene retrieval3d scene graphsnatural language interfacesembodied ai+3
13
citations
#79

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang et al.

AAAI 2025
13
citations
#80

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

ECCV 2024
12
citations
#81

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Zimin Xia, Yujiao Shi, HONGDONG LI et al.

ECCV 2024arXiv:2406.00474
cross-view localizationweakly supervised learningknowledge self-distillationpseudo ground truth+3
12
citations
#82

ChEX: Interactive Localization and Region Description in Chest X-rays

Philip Müller, Georgios Kaissis, Daniel Rueckert

ECCV 2024
12
citations
#83

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen et al.

ECCV 2024
12
citations
#84

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang, Shunpeng Chen, Yukun Song et al.

AAAI 2025
12
citations
#85

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel et al.

ICCV 2025
12
citations
#86

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Mohaiminul Islam, Thomas Seidl et al.

ECCV 2024
12
citations
#87

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng et al.

CVPR 2025
12
citations
#88

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Hong Zhang, Yixuan Lyu, Qian Yu et al.

ECCV 2024
11
citations
#89

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

Xinyue Zhu, Binghao Huang, Yunzhu Li

NeurIPS 2025
11
citations
#90

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Anindya Mondal, Sauradip Nag, Xiatian Zhu et al.

AAAI 2025
11
citations
#91

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng et al.

CVPR 2025
10
citations
#92

Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo, Lajanugen Logeswaran, Justin Johnson et al.

ICCV 2025
10
citations
#93

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Pengwei Yin, Jingjing Wang, Guanzhong Zeng et al.

ECCV 2024
9
citations
#94

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Xiaoxu Xu, Yitian Yuan, Jinlong Li et al.

ECCV 2024
9
citations
#95

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu et al.

CVPR 2025
9
citations
#96

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

Xiaoyu Zhou, Jingqi Wang, Yongtao Wang et al.

ICCV 2025
9
citations
#97

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

ECCV 2024arXiv:2407.05352
text-to-image diffusionphrase-level groundingpanoptic narrative groundingcross-attention mechanisms+4
9
citations
#98

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Chun Feng, Joy Hsu, Weiyu Liu et al.

CVPR 2024
9
citations
#99

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang et al.

AAAI 2025
9
citations
#100

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie, Julong Wei, Zhengjun Wang et al.

ECCV 2024
9
citations