2025 "visual grounding" Papers
16 papers found
Acknowledging Focus Ambiguity in Visual Questions
Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li et al.
ICCV 2025posterarXiv:2501.02201
2
citations
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
ICCV 2025posterarXiv:2501.04003
71
citations
Controlling Multimodal LLMs via Reward-guided Decoding
Oscar Mañas, Pierluca D'Oro, Koustuv Sinha et al.
ICCV 2025posterarXiv:2508.11616
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Jiansheng Li, Xingxuan Zhang, Hao Zou et al.
CVPR 2025highlightarXiv:2504.10158
1
citations
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
ICLR 2025poster
7
citations
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning
Ankan Deria, Adinath Dukre, feilong tang et al.
NeurIPS 2025oralarXiv:2506.15649
ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
Jiani Huang, Amish Sethi, Matthew Kuo et al.
NeurIPS 2025oralarXiv:2510.15963
F-LMM: Grounding Frozen Large Multimodal Models
Size Wu, Sheng Jin, Wenwei Zhang et al.
CVPR 2025posterarXiv:2406.05821
21
citations
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
NeurIPS 2025posterarXiv:2505.19678
6
citations
MLLMs Need 3D-Aware Representation Supervision for Scene Understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie et al.
NeurIPS 2025posterarXiv:2506.01946
17
citations
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
ZIYU ZHU, Xilin Wang, Yixuan Li et al.
ICCV 2025highlightarXiv:2507.04047
24
citations
PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training
Cong Chen, Mingyu Liu, Chenchen Jing et al.
ICLR 2025posterarXiv:2503.06486
25
citations
Seeing the Trees for the Forest: Rethinking Weakly-Supervised Medical Visual Grounding
Huy Ta, Duy Anh Huynh, Yutong Xie et al.
ICCV 2025highlightarXiv:2505.15123
2
citations
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
Divya Velayudhan, Abdelfatah Ahmed, Mohamad Alansari et al.
CVPR 2025highlightarXiv:2504.02823
2
citations
Visually Consistent Hierarchical Image Classification
Seulki Park, Youren Zhang, Stella Yu et al.
ICLR 2025posterarXiv:2406.11608
4
citations
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding
seil kang, Jinyeong Kim, Junhyeok Kim et al.
CVPR 2025highlightarXiv:2503.06287
31
citations