🧬Multimodal

Visual Grounding

Localizing objects from text descriptions

100 papers3,458 total citations

Compare with other topics

Feb '24 — Jan '26668 papers

Top Conferences

CVPR: 40 ECCV: 24 AAAI: 12 ICCV: 12 NeurIPS: 6 ICLR: 6

Top Papers

#1

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani et al.

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson et al.

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Hao Zhang, Hongyang Li, Feng Li et al.

ECCV 2024arXiv:2312.02949

grounded visual chatlarge multimodal modelsvisual groundingsegmentation models+3

114

citations

#5

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

Jihan Yang, Runyu Ding, Weipeng DENG et al.

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Weiyun Wang Weiyun, yiming ren, Haowen Luo et al.

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Yichi Zhang, Ziqiao Ma, Xiaofeng Gao et al.

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Walid Bousselham, Felix Petersen, Vittorio Ferrari et al.

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Junuk Cha, Jihyeon Kim, Jae Shin Yoon et al.

Text2Loc: 3D Point Cloud Localization from Natural Language

Yan Xia, Letian Shi, Zifeng Ding et al.

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li et al.

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Xiangyu Wang, Donglin Yang, ziqin wang et al.

ICLR 2025arXiv:2410.07087

vision-language navigationuav navigationtrajectory generationmultimodal understanding+4

52

citations

#14

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

David Wan, Jaemin Cho, Elias Stengel-Eskin et al.

Discovering and Mitigating Visual Biases through Keyword Explanation

Younghyun Kim, Sangwoo Mo, Minkyu Kim et al.

Describe Anything: Detailed Localized Image and Video Captioning

Long Lian, Yifan Ding, Yunhao Ge et al.

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

Shibo Zhao, Yuanjun Gao, Tianhao Wu et al.

HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios

HyunJun Jung, Shun-Cheng Wu, Patrick Ruhkamp et al.

Human-Object Interaction from Human-Level Instructions

Zhen Wu, Jiaman Li, Pei Xu et al.

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Jiuming Liu, Dong Zhuo, Zhiheng Feng et al.

ECCV 2024arXiv:2403.18274

visual-lidar fusionodometry estimationstructure alignmentlocal-to-global fusion+3

36

citations

#21

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

Yun Li, Yiming Zhang, Tao Lin et al.

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Qianhui Wu, Kanzhi Cheng, Rui Yang et al.

Mono3DVG: 3D Visual Grounding in Monocular Images

Yangfan Zhan, Yuan Yuan, Zhitong Xiong

AAAI 2024arXiv:2312.08022

3d visual groundingmonocular imagesgeometry-text embeddingsdepth prediction+4

35

citations

#24

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

Zekun Qi, Wenyao Zhang, Yufei Ding et al.

NeurIPS 2025arXiv:2502.13143

semantic orientation6-dof manipulationspatial reasoningzero-shot prediction+4

33

citations

#25

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Hengrui Kang, Siwei Wen, Zichen Wen et al.

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Guillaume Astruc, Nicolas Dufour, Ioannis Siglidis et al.

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

seil kang, Jinyeong Kim, Junhyeok Kim et al.

CVPR 2025arXiv:2503.06287

visual groundingattention headsvision-language modelsmultimodal capabilities+3

31

citations

#28

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

ye junyan, Zhutao Lv, Li Weijia et al.

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng et al.

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Damiano Marsili, Rohun Agrawal, Yisong Yue et al.

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Shehan Munasinghe, Hanan Gani, Wenqi Zhu et al.

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Jianing "Jed" Yang, Xuweiyi Chen, Nikhil Madaan et al.

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Yufei Zhan, Yousong Zhu, Zhiyang Chen et al.

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

Hoang-Quan Nguyen, Thanh-Dat Truong, Xuan-Bac Nguyen et al.

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Ozan Unal, Christos Sakaridis, Suman Saha et al.

ECCV 2024arXiv:2309.04561

3d visual groundingdense 3d groundinginstance segmentationverbo-visual fusion+4

29

citations

#36

UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence

Ruihai Wu, Haoran Lu, Yiyan Wang et al.

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Geng Li, Jinglin Xu, Yunzhen Zhao et al.

Improved baselines for vision-language pre-training

Jakob Verbeek, Enrico Fini, Michal Drozdzal et al.

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang et al.

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

Jungdae Lee, Taiki Miyanishi, Shuhei Kurita et al.

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Shah et al.

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025arXiv:2507.04047

embodied scene understanding3d vision-language learningvisual groundingactive perception+4

24

citations

#43

MANUS: Markerless Grasp Capture using Articulated 3D Gaussians

Chandradeep Pokhariya, Ishaan Shah, Angela Xing et al.

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang et al.

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Le Yang, Ziwei Zheng, Boxu Chen et al.

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

Keyang Zhou, Bharat Lal Bhatnagar, Jan Lenssen et al.

LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

Er Jin, Qihui Feng, Yongli Mou et al.

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Minghang Zheng, Xinhao Cai, Qingchao Chen et al.

ECCV 2024arXiv:2408.16219

video temporal groundingvision-language modelslarge language modelszero-shot learning+3

20

citations

#50

Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination

Qingwang Zhang, Yingying Zhu

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

Huajian Huang, Changkun Liu, Yipeng Zhu et al.

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain et al.

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Model

Yue Zhang, Zhiyang Xu, Ying Shen et al.

ICLR 2025arXiv:2410.03878

3d scene understandinglarge language modelsspatial reasoning3d visual representations+3

19

citations

#54

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Jaemin Cho, Elias Stengel-Eskin et al.

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Li Mi, Chang Xu, Javiera Castillo Navarro et al.

ECCV 2024arXiv:2403.13965

cross-view geo-localizationground view variationsorientation invariancefield of view resilience+3

19

citations

#56

Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Zhi Cai, Yingjie Gao, Yaoyan Zheng et al.

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Ziheng Zhou, Jinxing Zhou, Wei Qian et al.

Zero-Shot Aerial Object Detection with Visual Description Regularization

Chenyu Lin, Zhengqing Zang, Chenwei Tang et al.

AAAI 2024arXiv:2402.18233

zero-shot detectionaerial object detectionvisual description regularizationsemantic-visual correlation+4

18

citations

#59

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Issar Tzachor, Boaz Lerner, Matan Levy et al.

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Jiangyong Huang, Baoxiong Jia, Yan Wang et al.

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Tien Toan Nguyen, Minh Nhat Nhat Vu, Baoru Huang et al.

ECCV 2024arXiv:2407.13842

6-dof grasp detectionlanguage-driven roboticspoint cloud processingnegative prompt guidance+4

17

citations

#62

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation

Yuchen Su, Zhineng Chen, Zhiwen Shao et al.

AAAI 2024arXiv:2306.15142

scene text detectionlow-rank approximationarbitrary-shaped textshape representation+4

17

citations

#63

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Haoxuanye Ji, Pengpeng Liang, Erkang Cheng

Motion Prior Knowledge Learning with Homogeneous Language Descriptions for Moving Infrared Small Target Detection

Shengjia Chen, Luping Ji, Weiwei Duan et al.

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Junyan Ye, Honglin Lin, Leyan Ou et al.

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

eslam Abdelrahman, Mohamed Ayman Mohamed, Mahmoud Ahmed et al.

Grounded Object-Centric Learning

Avinash Kori, Francesco Locatello, Fabio De Sousa Ribeiro et al.

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

Wei Deng, Mengshi Qi, Huadong Ma

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

Nicolas Dufour, Vicky Kalogeiton, David Picard et al.

CVPR 2025arXiv:2412.06781

visual geolocationgenerative geolocationdiffusion modelsriemannian flow matching+3

16

citations

#70

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Haifeng Huang, Xinyi Chen, Yilun Chen et al.

CVPR 2025arXiv:2504.21530

robotic manipulationgrounding masksvision-language modelsintermediate representations+3

15

citations

#71

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

Dongjin Kim, Sung Jin Um, Sangmin Lee et al.

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Hao Tang, Chen-Wei Xie, Haiyang Wang et al.

NeurIPS 2025arXiv:2503.01342

fine-grained visual perceptionopen-ended language interfaceinstance segmentationsemantic segmentation+4

14

citations

#73

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

Yifu Chen, Jingwen Chen, Yingwei Pan et al.

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

Lei Fan, Dongdong Fan, Zhiguang Hu et al.

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

Yuncong Yang, Jiageng Liu, Zheyuan Zhang et al.

F3Loc: Fusion and Filtering for Floorplan Localization

Changan Chen, Rui Wang, Christoph Vogel et al.

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Zhenxiang Lin, Xidong Peng, peishan cong et al.

ECCV 2024arXiv:2304.05645

3d visual groundinglarge-scale dynamic scenesmulti-modal visual datanatural language descriptions+3

13

citations

#78

Where am I? Scene Retrieval with Language

Jiaqi Chen, Daniel Barath, Iro Armeni et al.

ECCV 2024arXiv:2404.14565

scene retrieval3d scene graphsnatural language interfacesembodied ai+3

13

citations

#79

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Ming Dai, Jian Li, Jiedong Zhuang et al.

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

Adapting Fine-Grained Cross-View Localization to Areas without Fine Ground Truth

Zimin Xia, Yujiao Shi, HONGDONG LI et al.

ECCV 2024arXiv:2406.00474

cross-view localizationweakly supervised learningknowledge self-distillationpseudo ground truth+3

12

citations

#82

ChEX: Interactive Localization and Region Description in Chest X-rays

Philip Müller, Georgios Kaissis, Daniel Rueckert

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Pengkun Jiao, Na Zhao, Jingjing Chen et al.

Focus on Local: Finding Reliable Discriminative Regions for Visual Place Recognition

Changwei Wang, Shunpeng Chen, Yukun Song et al.

LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes

Juliette Marrie, Romain Menegaux, Michael Arbel et al.

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Mohaiminul Islam, Thomas Seidl et al.

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng et al.

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Hong Zhang, Yixuan Lyu, Qian Yu et al.

Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper

Xinyue Zhu, Binghao Huang, Yunzhu Li

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Anindya Mondal, Sauradip Nag, Xiatian Zhu et al.

ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng et al.

Visual Test-time Scaling for GUI Agent Grounding

Tiange Luo, Lajanugen Logeswaran, Justin Johnson et al.

LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation

Pengwei Yin, Jingjing Wang, Guanzhong Zeng et al.

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Xiaoxu Xu, Yitian Yuan, Jinlong Li et al.

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu et al.

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

Xiaoyu Zhou, Jingqi Wang, Yongtao Wang et al.

Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model

Danni Yang, Ruohan Dong, Jiayi Ji et al.

ECCV 2024arXiv:2407.05352

text-to-image diffusionphrase-level groundingpanoptic narrative groundingcross-attention mechanisms+4

9

citations

#98

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Chun Feng, Joy Hsu, Weiyu Liu et al.

Attention-Driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models Without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang et al.

O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation

Muer Tie, Julong Wei, Zhengjun Wang et al.

ECCV 2024

9

citations

Visual Grounding

Top Conferences

Related Topics (Multimodal)

Top Papers

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction

Text2Loc: 3D Point Cloud Localization from Natural Language

Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training

Discovering and Mitigating Visual Biases through Keyword Explanation

Describe Anything: Detailed Localized Image and Video Captioning

SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments

HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios

Human-Object Interaction from Human-Level Instructions

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?

GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents

Mono3DVG: 3D Visual Grounding in Monocular Images

SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation

LEGION: Learning to Ground and Explain for Synthetic Image Detection

OpenStreetView-5M: The Many Roads to Global Visual Geolocation

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Visual Agentic AI for Spatial Reasoning with a Dynamic API

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination

Griffon: Spelling out All Object Locations at Any Granularity with Large Language Models

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence

DyFo: A Training-Free Dynamic Focus Visual Search for Enhancing LMMs in Fine-Grained Visual Understanding

Improved baselines for vision-language pre-training

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

CityNav: A Large-Scale Dataset for Real-World Aerial Navigation

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

MANUS: Markerless Grasp Capture using Articulated 3D Gaussians

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

GEARS: Local Geometry-aware Hand-object Interaction Synthesis

LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination

360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Model

CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

ConGeo: Robust Cross-view Geo-localization across Ground View Variations

Crowd-SAM:SAM as a smart annotator for object detection in crowded scenes

Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

Zero-Shot Aerial Object Detection with Visual Description Regularization

EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation

Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors

Motion Prior Knowledge Learning with Homogeneous Language Descriptions for Moving Infrared Small Target Detection

Where am I? Cross-View Geo-localization with Natural Language Descriptions

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding

Grounded Object-Centric Learning

Global-Local Tree Search in VLMs for 3D Indoor Scene Generation

Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation

RoboGround: Robotic Manipulation with Grounded Vision-Language Priors

Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

Improving Text-guided Object Inpainting with Semantic Pre-inpainting

MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection Dataset for Tiny Objects

MindJourney: Test-Time Scaling with World Models for Spatial Reasoning

F3Loc: Fusion and Filtering for Floorplan Localization