IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

8citations

arXiv:2506.23329 Project

citations

#764

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Hengyu Liu Chenxin Li Zhengxin Li Yipeng Wu Wuyang Li Zhiqin Yang Zhenyuan Zhang Yunlong Lin Sirui Han Brandon Feng

Abstract

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This ''understanding-by-creating'' approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility. Initial experiments on agentic inverse rendering powered by various state-of-the-art VLMs highlight current limitations, particularly in visual precision rather than basic tool usage. IR3D-Bench, including data and evaluation protocols, is released to facilitate systematic study and development of tool-using VLAs towards genuine scene understanding by creating.

Citation History

Jan 25, 2026

Jan 26, 2026

Jan 28, 2026

Feb 13, 2026

8+8

Feb 13, 2026