Visual Grounding

CVPR 2024arXiv:2311.18482

#2

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

171

CVPR 2024arXiv:2304.00962

#3

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng DENG et al.

103

CVPR 2024arXiv:2402.16846

#4

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

75

CVPR 2024arXiv:2312.00878

#5

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

74

CVPR 2024arXiv:2311.15977

#6

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding et al.

54

ICCV 2025arXiv:2504.16072

#7

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin et al.

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

49

AAAI 2024arXiv:2312.08022

#10

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu, Kanzhi Cheng, Rui Yang et al.

Mono3DVG: 3D Visual Grounding in Monocular Images

Yangfan Zhan, Yuan Yuan, Zhitong Xiong

3d visual groundingmonocular imagesgeometry-text embeddingsdepth prediction+4

35

ECCV 2024arXiv:2311.14552

#12

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu et al.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

object localizationlarge vision language modelsfine-grained object perceptionlanguage-prompted localization+4

30

ICCV 2025arXiv:2411.19325

#15

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

24

ICCV 2025arXiv:2507.04047

#16

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

embodied scene understanding3d vision-language learningvisual groundingactive perception+4

23

ICCV 2025arXiv:2504.15485

#17

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

19

CVPR 2025arXiv:2503.22420

#18

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang, Baoxiong Jia, Yan Wang et al.

17

CVPR 2024arXiv:2403.06093

#19

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Haoxuanye Ji, Pengpeng Liang, Erkang Cheng

17

ICLR 2024arXiv:2310.06214

#20

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

eslam Abdelrahman, Mohamed Ayman Mohamed, Mahmoud Ahmed et al.

16

ICCV 2025arXiv:2412.17007

#21

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

16

CVPR 2024arXiv:2403.17420

#22

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

15

ECCV 2024arXiv:2304.05645

#23

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, peishan cong et al.

3d visual groundinglarge-scale dynamic scenesmulti-modal visual datanatural language descriptions+3

13

AAAI 2025arXiv:2501.06710

#24

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang et al.

13

AAAI 2025arXiv:2504.09881

#25

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang, Shunpeng Chen, Yukun Song et al.

12

ECCV 2024arXiv:2407.05256

#26

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng et al.

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen et al.

open-vocabulary 3d detectionvision-language modelshierarchical alignmentzero-shot discovery+2

12

CVPR 2025arXiv:2412.12718

#28

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng et al.

10

ECCV 2024arXiv:2407.05352

#29

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

text-to-image diffusionphrase-level groundingpanoptic narrative groundingcross-attention mechanisms+4

CVPR 2025arXiv:2411.14901

#30

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu et al.

AAAI 2025arXiv:2412.10840

#31

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang et al.

CVPR 2024arXiv:2404.19696

#32

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Chun Feng, Joy Hsu, Weiyu Liu et al.

CVPR 2025arXiv:2505.04270

#33

Object-Shot Enhanced Grounding Network for Egocentric Video

Yisen Feng, Haoyu Zhang, Meng Liu et al.

7

CVPR 2024arXiv:2402.08359

#34

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

TAO MA, Bing Bai, Haozhe Lin et al.

Learning to Produce Semi-dense Correspondences for Visual Localization

Khang Truong Giang, Soohwan Song, Sungho Jo

CVPR 2024arXiv:2401.01482

#36

Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition

Kyle Buettner, Sina Malakouti, Xiang Li et al.

spatio-temporal video groundingzero-shot learningmultimodal modulationvisual-language models+4

#37

E3M: Zero-Shot Spatio-Temporal Video Grounding with Expectation-Maximization Multimodal Modulation

Peijun Bao, Zihao Shao, Wenhan Yang et al.

ECCV 2024

AAAI 2025arXiv:2501.09428

#38

AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring

Xinyi Wang, Na Zhao, Zhiyuan Han et al.

CVPR 2024arXiv:2403.06846

#39

CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

Jinpeng Li, Haiping Wang, Jiabin chen et al.

DiaLoc: An Iterative Approach to Embodied Dialog Localization

Chao Zhang, Mohan Li, Ignas Budvytis et al.

5

CVPR 2025arXiv:2506.18557

#41

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

sound source localizationaudio-visual correspondencemultimodal large language modelsobject-aware contrastive alignment+2

5

ICCV 2025arXiv:2508.00518

#42

Fine-grained Spatiotemporal Grounding on Egocentric Videos

Shuo LIANG, Yiwu Zhong, Zi-Yuan Hu et al.

spatiotemporal video groundingegocentric video understandingpixel-level benchmarkautomatic annotation pipeline+4

5

ICCV 2025arXiv:2509.04833

#43

GAP: Gaussianize Any Point Clouds with Text Guidance

Weiqi Zhang, Junsheng Zhou, Haotian Geng et al.

Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

Sai Wang, Yutian Lin, Yu Wu

Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras

Lingdong Kong, Dongyue Lu, Alan Liang et al.

PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai, Wenxuan Cheng, Jiedong Zhuang et al.

CVPR 2025arXiv:2504.09623

#47

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, yina xie, Guanxin tan et al.

Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju et al.

ICCV 2025arXiv:2405.18937

#49

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description

Mahmoud Ahmed, Junjie Fei, Jian Ding et al.

AAAI 2025arXiv:2412.07157

#50

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu et al.

AAAI 2025arXiv:2412.19542

#51

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions

Xiaoyang Liu, Boran Wen, Xinpeng Liu et al.

ICLR 2025arXiv:2503.23508

#52

Re-Aligning Language to Visual Objects with an Agentic Workflow

Yuming Chen, Jiangyan Feng, Haodong Zhang et al.

AAAI 2025arXiv:2412.09050

#53

Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation

Yifei Su, Dong An, Kehan Chen et al.

ContextHOI: Spatial Context Learning for Human-Object Interaction Detection

Mingda Jia, Liming Zhao, Ge Li et al.

ICCV 2025arXiv:2411.13317

#55

Teaching VLMs to Localize Specific Objects from In-context Examples

Sivan Doveh, Nimrod Shabtay, Eli Schwartz et al.

ICCV 2025arXiv:2506.21188

#56

GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin, Shuting He, Cheston Tan et al.

ECCV 2024arXiv:2407.02846

#57

Multi-Task Domain Adaptation for Language Grounding with 3D Objects

Penglei SUN, Yaoxian Song, Xinglin Pan et al.

language grounding3d object understandingdomain adaptationvision-language alignment+4

ICCV 2025arXiv:2506.23352

#58

GroundingFace: Fine-grained Face Understanding via Pixel Grounding Multimodal Large Language Model

Yue Han, Jiangning Zhang, Junwei Zhu et al.

GeoProg3D: Compositional Visual Reasoning for City-Scale 3D Language Fields

Shunsuke Yasuki, Taiki Miyanishi, Nakamasa Inoue et al.

3d language fieldscity-scale 3d scenescompositional visual reasoninggeographic vision apis+4

1

ICCV 2025arXiv:2504.08531

#60

Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions

Tommaso Galliena, Tommaso Apicella, Stefano Rosa et al.

1

CVPR 2025arXiv:2505.19503

#61

PlaceIt3D: Language-Guided Object Placement in Real 3D Scenes

Ahmed Abdelreheem, Filippo Aleotti, Jamie Watson et al.

Visual Intention Grounding for Egocentric Assistants

Pengzhan Sun, Junbin Xiao, Tze Ho Elden Tse et al.

Visual Consensus Prompting for Co-Salient Object Detection

Jie Wang, Nana Yu, Zihao Zhang et al.

Interpreting Object-level Foundation Models via Visual Precision Search

Ruoyu Chen, Siyuan Liang, Jingzhi Li et al.

Locality-Aware Zero-Shot Human-Object Interaction Detection

Sanghyun Kim, Deunsol Jung, Minsu Cho

zero-shot learninghuman-object interaction detectionvision-language modelsclip representations+3

CVPR 2025arXiv:2412.04383

#66

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

Rong Li, Shijie Li, Lingdong Kong et al.

3d visual groundingzero-shot learningvision-language models3d scene representation+4

CVPR 2025arXiv:2504.21530

#67

Text-guided Sparse Voxel Pruning for Efficient 3D Visual Grounding

Wenxuan Guo, Xiuwei Xu, Ziwei Wang et al.

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang, Minsu Cho

GOAL: Global-local Object Alignment Learning

Hyungyu Choi, Young Kyun Jang, Chanho Eom

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler et al.

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang, Xinyi Chen, Yilun Chen et al.

robotic manipulationgrounding masksvision-language modelsintermediate representations+3

CVPR 2025arXiv:2504.04744

#72

Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions

He Zhu, Quyu Kong, Kechun Xu et al.

3d affordance groundinglanguage-guided perceptionvision-language modelspartial-view observation+4

CVPR 2025arXiv:2504.01472

#73

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

ruotian peng, Haiying He, Yake Wei et al.

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction

YUEJIAO SU, Yi Wang, Qiongyang Hu et al.

egocentric interaction perceptionpixel groundingmultimodal large language modelsscene understanding+4

CVPR 2025arXiv:2503.20348

#75

Language-Guided Salient Object Ranking

Fang Liu, Yuhao Liu, Ke Xu et al.

VideoGEM: Training-free Action Grounding in Videos

Felix Vogel, Walid Bousselham, Anna Kukleva et al.

action groundingvision-language modelszero-shot localizationspatial video grounding+4

CVPR 2025arXiv:2503.06287

#77

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

visual groundingattention headsvision-language modelsmultimodal capabilities+3

ECCV 2024arXiv:2407.03200

#78

Seeing Speech and Sound: Distinguishing and Locating Audio Sources in Visual Scenes

Hyeonggon Ryu, Seongyu Kim, Joon Chung et al.

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Weitai Kang, Gaowen Liu, Shah Mubarak et al.

visual groundingsegmentation supervisionbounding box regressionmulti-task encoder-decoder+3

ICCV 2025arXiv:2412.00622

#80

GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding

Yawen Shao, Wei Zhai, Yuhang Yang et al.

Towards Open-Vocabulary Audio-Visual Event Localization

Jinxing Zhou, Dan Guo, Ruohao Guo et al.

Active Learning Meets Foundation Models: Fast Remote Sensing Data Annotation for Object Detection

Marvin Burges, Philipe Dias, Dalton Lunga et al.

VGMamba: Attribute-to-Location Clue Reasoning for Quantity-Agnostic 3D Visual Grounding

Zhu Yihang, Jinhao Zhang, Yuxuan Wang et al.

Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor Rapela Medeiros, Atif Belal, Srikanth Muralidharan et al.

vision-language object detectionmodality adaptationvisual prompt strategyzero-shot capability+4

ICCV 2025arXiv:2503.17044

#85

AGO: Adaptive Grounding for Open World 3D Occupancy Prediction

Peizheng Li, Shuxiao Ding, You Zhou et al.

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Yunqiu Xu, Linchao Zhu, Yi Yang

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Achint Soni, Meet Soni, Sirisha Rambhatla

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations

Tianming Liang, Kun-Yu Lin, Chaolei Tan et al.

Enhancing Zero-shot Object Counting via Text-guided Local Ranking and Number-evoked Global Attention

Shiwei Zhang, Qi Zhou, Wei Ke

ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail

Chandan Yeshwanth, David Rozenberszki, Angela Dai

ICCV 2025arXiv:2504.07454

#91

How Can Objects Help Video-Language Understanding?

Zitian Tang, Shijie Wang, Junho Cho et al.

video-language understandingmultimodal large language modelsobject-centric representationvideo question answering+2

ICCV 2025arXiv:2510.17023

#92

Region-aware Anchoring Mechanism for Efficient Referring Visual Grounding

Shuyi Ouyang, Ziwei Niu, Hongyi Wang et al.

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Shraman Pramanick, Effrosyni Mavroudi, Yale Song et al.

ICCV 2025arXiv:2507.00659

#94

NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals

Jiro Abe, Gaku Nakano, Kazumine Ogura

LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

Juelin Zhu, Shuaibang Peng, Long Wang et al.

aerial visual localizationlow level-of-detail modelssilhouette alignmentbuilding segmentation+4

#96

Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Mingfang Zhang, Ryo Yonetani, Yifei Huang et al.

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

Siming Yan, Min Bai, Weifeng Chen et al.

Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang, Qirui Chen, Cilin Yan et al.

DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin et al.

Visual Textualization for Image Prompted Object Detection

Yongjian Wu, Yang Zhou, Jiya Saiyin et al.

ICCV 2025