🧬Multimodal

Visual Grounding

Localizing objects from text descriptions

190 papers(showing top 100)1,703 total citations
Compare with other topics
Mar '24 β€” Feb '26156 papers
Also includes: visual grounding, referring expression, grounded generation

Top Papers

#1

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

CVPR 2024arXiv:2401.12168
550
citations
#2

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

CVPR 2024arXiv:2311.18482
171
citations
#3

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng DENG et al.

CVPR 2024arXiv:2304.00962
103
citations
#4

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

CVPR 2024arXiv:2402.16846
75
citations
#5

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

CVPR 2024arXiv:2312.00878
74
citations
#6

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding et al.

CVPR 2024arXiv:2311.15977
54
citations
#7

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

NeurIPS 2025
53
citations
#8

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin et al.

ECCV 2024
50
citations
#9

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

ICCV 2025arXiv:2504.16072
49
citations
#10

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu, Kanzhi Cheng, Rui Yang et al.

NeurIPS 2025
36
citations
#11

Mono3DVG: 3D Visual Grounding in Monocular Images

Yangfan Zhan, Yuan Yuan, Zhitong Xiong

AAAI 2024arXiv:2312.08022
3d visual groundingmonocular imagesgeometry-text embeddingsdepth prediction+4
35
citations
#12

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

ICCV 2025
32
citations
#13

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu et al.

CVPR 2025
30
citations
#14

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

ECCV 2024arXiv:2311.14552
object localizationlarge vision language modelsfine-grained object perceptionlanguage-prompted localization+4
30
citations
#15

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

ICCV 2025arXiv:2411.19325
24
citations
#16

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025arXiv:2507.04047
embodied scene understanding3d vision-language learningvisual groundingactive perception+4
23
citations
#17

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

ICCV 2025arXiv:2504.15485
19
citations
#18

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang, Baoxiong Jia, Yan Wang et al.

CVPR 2025arXiv:2503.22420
17
citations
#19

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Haoxuanye Ji, Pengpeng Liang, Erkang Cheng

CVPR 2024arXiv:2403.06093
17
citations
#20

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

eslam Abdelrahman, Mohamed Ayman Mohamed, Mahmoud Ahmed et al.

ICLR 2024arXiv:2310.06214
16
citations
#21

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

ICCV 2025arXiv:2412.17007
16
citations
#22

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

CVPR 2024arXiv:2403.17420
15
citations
#23

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, peishan cong et al.

ECCV 2024arXiv:2304.05645
3d visual groundinglarge-scale dynamic scenesmulti-modal visual datanatural language descriptions+3
13
citations
#24

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang et al.

AAAI 2025arXiv:2501.06710
13
citations
#25

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang, Shunpeng Chen, Yukun Song et al.

AAAI 2025arXiv:2504.09881
12
citations
#26

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng et al.

CVPR 2025
12
citations
#27

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen et al.

ECCV 2024arXiv:2407.05256
open-vocabulary 3d detectionvision-language modelshierarchical alignmentzero-shot discovery+2
12
citations
#28

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng et al.

CVPR 2025arXiv:2412.12718
10
citations
#29

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

ECCV 2024arXiv:2407.05352
text-to-image diffusionphrase-level groundingpanoptic narrative groundingcross-attention mechanisms+4
9
citations
#30

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu et al.

CVPR 2025arXiv:2411.14901
9
citations
#31

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang et al.

AAAI 2025arXiv:2412.10840
9
citations
#32

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Chun Feng, Joy Hsu, Weiyu Liu et al.

CVPR 2024arXiv:2404.19696
9
citations
#33

Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng, Haoyu Zhang, Meng Liu et al.

CVPR 2025arXiv:2505.04270
7
citations
#34

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

TAO MA, Bing Bai, Haozhe Lin et al.

CVPR 2024
7
citations
#35

Learning to Produce Semi-dense Correspondences for Visual Localization

Khang Truong Giang, Soohwan Song, Sungho Jo

CVPR 2024arXiv:2402.08359
6
citations
#36

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

Kyle Buettner, Sina Malakouti, Xiang Li et al.

CVPR 2024arXiv:2401.01482
6
citations
#37

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Peijun Bao, Zihao Shao, Wenhan Yang et al.

ECCV 2024
spatio-temporal video groundingzero-shot learningmultimodal modulationvisual-language models+4
6
citations
#38

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Xinyi Wang, Na Zhao, Zhiyuan Han et al.

AAAI 2025arXiv:2501.09428
6
citations
#39

CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

Jinpeng Li, Haiping Wang, Jiabin chen et al.

ICLR 2025
5
citations
#40

DiaLoc: An Iterative Approach to Embodied Dialog Localization

Chao Zhang, Mohan Li, Ignas Budvytis et al.

CVPR 2024arXiv:2403.06846
5
citations
#41

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025arXiv:2506.18557
sound source localizationaudio-visual correspondencemultimodal large language modelsobject-aware contrastive alignment+2
5
citations
#42

Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo LIANG, Yiwu Zhong, Zi-Yuan Hu et al.

ICCV 2025arXiv:2508.00518
spatiotemporal video groundingegocentric video understandingpixel-level benchmarkautomatic annotation pipeline+4
5
citations
#43

GAP: Gaussianize Any Point Clouds with Text Guidance

Weiqi Zhang, Junsheng Zhou, Haotian Geng et al.

ICCV 2025
4
citations
#44

Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

Sai Wang, Yutian Lin, Yu Wu

CVPR 2024
4
citations
#45

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong, Dongyue Lu, Alan Liang et al.

NeurIPS 2025
4
citations
#46

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai, Wenxuan Cheng, Jiedong Zhuang et al.

ICCV 2025arXiv:2509.04833
3
citations
#47

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, yina xie, Guanxin tan et al.

CVPR 2025
3
citations
#48

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju et al.

CVPR 2025arXiv:2504.09623
3
citations
#49

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed, Junjie Fei, Jian Ding et al.

ICCV 2025arXiv:2405.18937
3
citations
#50

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu et al.

AAAI 2025arXiv:2412.07157
3
citations
#51

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Xiaoyang Liu, Boran Wen, Xinpeng Liu et al.

AAAI 2025arXiv:2412.19542
3
citations
#52

Re-Aligning Language to Visual Objects with an Agentic Workflow

Yuming Chen, Jiangyan Feng, Haodong Zhang et al.

ICLR 2025arXiv:2503.23508
3
citations
#53

Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation

Yifei Su, Dong An, Kehan Chen et al.

AAAI 2025
2
citations
#54

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia, Liming Zhao, Ge Li et al.

AAAI 2025arXiv:2412.09050
2
citations
#55

Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh, Nimrod Shabtay, Eli Schwartz et al.

ICCV 2025arXiv:2411.13317
2
citations
#56

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin, Shuting He, Cheston Tan et al.

ICCV 2025arXiv:2506.21188
2
citations
#57

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Penglei SUN, Yaoxian Song, Xinglin Pan et al.

ECCV 2024arXiv:2407.02846
language grounding3d object understandingdomain adaptationvision-language alignment+4
2
citations
#58

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Yue Han, Jiangning Zhang, Junwei Zhu et al.

CVPR 2025
1
citations
#59

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue et al.

ICCV 2025arXiv:2506.23352
3d language fieldscity-scale 3d scenescompositional visual reasoninggeographic vision apis+4
1
citations
#60

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.

ICCV 2025arXiv:2504.08531
1
citations
#61

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson et al.

ICCV 2025
1
citations
#62

Visual Intention Grounding for Egocentric Assistants

Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse et al.

ICCV 2025
1
citations
#63

Visual Consensus Prompting for Co-Salient Object Detection

Jie Wang, Nana Yu, Zihao Zhang et al.

CVPR 2025
β€”
not collected
#64

Interpreting Object-level Foundation Models via Visual Precision Search

Ruoyu Chen, Siyuan Liang, Jingzhi Li et al.

CVPR 2025
β€”
not collected
#65

Locality-Aware Zero-Shot Human-Object Interaction Detection

Sanghyun Kim, Deunsol Jung, Minsu Cho

CVPR 2025arXiv:2505.19503
zero-shot learninghuman-object interaction detectionvision-language modelsclip representations+3
β€”
not collected
#66

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li, Shijie Li, Lingdong Kong et al.

CVPR 2025arXiv:2412.04383
3d visual groundingzero-shot learningvision-language models3d scene representation+4
β€”
not collected
#67

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

Wenxuan Guo, Xiuwei Xu, Ziwei Wang et al.

CVPR 2025
β€”
not collected
#68

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang, Minsu Cho

ECCV 2024
β€”
not collected
#69

GOAL: Global-local Object Alignment Learning

Hyungyu Choi, Young Kyun Jang, Chanho Eom

CVPR 2025
β€”
not collected
#70

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler et al.

CVPR 2025
β€”
not collected
#71

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang, Xinyi Chen, Yilun Chen et al.

CVPR 2025arXiv:2504.21530
robotic manipulationgrounding masksvision-language modelsintermediate representations+3
β€”
not collected
#72

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu, Quyu Kong, Kechun Xu et al.

CVPR 2025arXiv:2504.04744
3d affordance groundinglanguage-guided perceptionvision-language modelspartial-view observation+4
β€”
not collected
#73

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

ruotian peng, Haiying He, Yake Wei et al.

CVPR 2025
β€”
not collected
#74

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

YUEJIAO SU, Yi Wang, Qiongyang Hu et al.

CVPR 2025arXiv:2504.01472
egocentric interaction perceptionpixel groundingmultimodal large language modelsscene understanding+4
β€”
not collected
#75

Language-Guided Salient Object Ranking

Fang Liu, Yuhao Liu, Ke Xu et al.

CVPR 2025
β€”
not collected
#76

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel, Walid Bousselham, Anna Kukleva et al.

CVPR 2025arXiv:2503.20348
action groundingvision-language modelszero-shot localizationspatial video grounding+4
β€”
not collected
#77

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

CVPR 2025arXiv:2503.06287
visual groundingattention headsvision-language modelsmultimodal capabilities+3
β€”
not collected
#78

Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Hyeonggon Ryu, Seongyu Kim, Joon Chung et al.

CVPR 2025
β€”
not collected
#79

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang, Gaowen Liu, Shah Mubarak et al.

ECCV 2024arXiv:2407.03200
visual groundingsegmentation supervisionbounding box regressionmulti-task encoder-decoder+3
β€”
not collected
#80

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Yawen Shao, Wei Zhai, Yuhang Yang et al.

CVPR 2025
β€”
not collected
#81

Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou, Dan Guo, Ruohao Guo et al.

CVPR 2025
β€”
not collected
#82

Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection

Marvin Burges, Philipe Dias, Dalton Lunga et al.

ICCV 2025
β€”
not collected
#83

VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding

Zhu Yihang, Jinhao Zhang, Yuxuan Wang et al.

ICCV 2025
β€”
not collected
#84

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor Rapela Medeiros, Atif Belal, Srikanth Muralidharan et al.

ICCV 2025arXiv:2412.00622
vision-language object detectionmodality adaptationvisual prompt strategyzero-shot capability+4
β€”
not collected
#85

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li, Shuxiao Ding, You Zhou et al.

ICCV 2025
β€”
not collected
#86

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu, Linchao Zhu, Yi Yang

ICCV 2025
β€”
not collected
#87

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Achint Soni, Meet Soni, Sirisha Rambhatla

ICCV 2025
β€”
not collected
#88

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan et al.

ICCV 2025
β€”
not collected
#89

Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention

Shiwei Zhang, Qi Zhou, Wei Ke

ICCV 2025
β€”
not collected
#90

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth, David Rozenberszki, Angela Dai

ICCV 2025arXiv:2503.17044
β€”
not collected
#91

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho et al.

ICCV 2025arXiv:2504.07454
video-language understandingmultimodal large language modelsobject-centric representationvideo question answering+2
β€”
not collected
#92

Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding

Shuyi Ouyang, Ziwei Niu, Hongyi Wang et al.

ICCV 2025
β€”
not collected
#93

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Shraman Pramanick, Effrosyni Mavroudi, Yale Song et al.

ICCV 2025arXiv:2510.17023
β€”
not collected
#94

NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals

Jiro Abe, Gaku Nakano, Kazumine Ogura

ICCV 2025
β€”
not collected
#95

LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

Juelin Zhu, Shuaibang Peng, Long Wang et al.

ICCV 2025arXiv:2507.00659
aerial visual localizationlow level-of-detail modelssilhouette alignmentbuilding segmentation+4
β€”
not collected
#96

Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Mingfang Zhang, Ryo Yonetani, Yifei Huang et al.

ICCV 2025
β€”
not collected
#97

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan, Min Bai, Weifeng Chen et al.

ECCV 2024
β€”
not collected
#98

Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang, Qirui Chen, Cilin Yan et al.

ICCV 2025
β€”
not collected
#99

DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin et al.

ICCV 2025
β€”
not collected
#100

Visual Textualization for Image Prompted Object Detection

Yongjian Wu, Yang Zhou, Jiya Saiyin et al.

ICCV 2025
β€”
not collected