MetaFind: Scene-Aware 3D Asset Retrieval for Coherent Metaverse Scene Generation

1citations

arXiv:2510.04057

Citations

#1443

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Zhenyu Pan Yucheng Lu Han Liu

Topics

3d asset retrieval metaverse scene generation tri-modal retrieval scene-aware retrieval equivariant layout encoder spatial relationship modeling compositional retrieval framework scene coherence

Abstract

We present MetaFind, a scene-aware tri-modal compositional retrieval framework designed to enhance scene generation in the metaverse by retrieving 3D assets from large-scale repositories. MetaFind addresses two core challenges: (i) inconsistent asset retrieval that overlooks spatial, semantic, and stylistic constraints, and (ii) the absence of a standardized retrieval paradigm specifically tailored for 3D asset retrieval, as existing approaches mainly rely on general-purpose 3D shape representation models. Our key innovation is a flexible retrieval mechanism that supports arbitrary combinations of text, image, and 3D modalities as queries, enhancing spatial reasoning and style consistency by jointly modeling object-level features (including appearance) and scene-level layout structures. Methodologically, MetaFind introduces a plug-and-play equivariant layout encoder ESSGNN that captures spatial relationships and object appearance features, ensuring retrieved 3D assets are contextually and stylistically coherent with the existing scene, regardless of coordinate frame transformations. The framework supports iterative scene construction by continuously adapting retrieval results to current scene updates. Empirical evaluations demonstrate the improved spatial and stylistic consistency of MetaFind in various retrieval tasks compared to baseline methods.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

1+1