🧬Vision Recognition

Scene Understanding

Holistic understanding of visual scenes

100 papers3,472 total citations

Compare with other topics

Feb '24 — Jan '26413 papers

Top Conferences

CVPR: 52 ICCV: 16 ECCV: 15 AAAI: 9 ICLR: 4 NeurIPS: 4

Top Papers

#1

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

Zeyu Yang, Hongye Yang, Zijie Pan et al.

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

Yihua Huang, Yangtian Sun, Ziyi Yang et al.

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang et al.

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

Jin-Chuan Shi, Miao Wang, Haobin Duan et al.

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Weiyun Wang, Min Shi, Qingyun Li et al.

OmniRe: Omni Urban Scene Reconstruction

Ziyu Chen, Jiawei Yang, Jiahui Huang et al.

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Tianchen Deng, Guole Shen, Tong Qin et al.

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Zeqi Xiao, Tai Wang, Jingbo Wang et al.

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

Shuting He, Henghui Ding

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Renjie Pi, Lewei Yao, Jiahui Gao et al.

Wonderland: Navigating 3D Scenes from a Single Image

Hanwen Liang, Junli Cao, Vidit Goel et al.

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

Yufeng Huang, Jiji Tang, Zhuo Chen et al.

AAAI 2024arXiv:2305.06152

scene graph knowledgemulti-modal structured representationsvision-language pre-trainingimage-text matching+3

49

citations

#14

YOLOE: Real-Time Seeing Anything

Ao Wang, Lihao Liu, Hui Chen et al.

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Size Wu, Wenwei Zhang, Lumin Xu et al.

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

Bohan Li, Jiajun Deng, Wenyao Zhang et al.

ECCV 2024arXiv:2407.02077

semantic scene completiontemporal context learningcamera-based 3d reconstructioncross-frame affinity measurement+4

31

citations

#17

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Haiming Zhang, Xu Yan, Dongfeng Bai et al.

AAAI 2024arXiv:2312.11829

3d occupancy predictioncross-modal knowledge distillationmulti-view imagesvolume rendering+4

31

citations

#18

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

Weifeng Lin, Xinyu Wei, Ruichuan An et al.

NeurIPS 2025arXiv:2506.05302

region-level visual understandingobject segmentationlarge language modelssemantic perceiver+4

29

citations

#19

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

Hao Li, Dingwen Zhang, Yalun Dai et al.

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang et al.

ECCV 2024arXiv:2407.02945

urban scene reconstruction3d gaussian splattingextrapolated view synthesisneural rendering+4

28

citations

#21

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Trong-Thuan Nguyen, Pha Nguyen, Khoa Luu

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting

Ri-Zhao Qiu, Ge Yang, Weijia Zeng et al.

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

Xinshun Wang, Zhongbin Fang, Xia Li et al.

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

Linzhan Mou, Jun-Kun Chen, Yu-Xiong Wang

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Hongyan Zhi, Peihao Chen, Junyan Li et al.

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

ZIYU ZHU, Xilin Wang, Yixuan Li et al.

ICCV 2025arXiv:2507.04047

embodied scene understanding3d vision-language learningvisual groundingactive perception+4

24

citations

#27

360+x: A Panoptic Multi-modal Scene Understanding Dataset

Hao Chen, Yuqi Hou, Chenyuan Qu et al.

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

Ce Zhang, Simon Stepputtis, Joseph Campbell et al.

PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

Yuan Dong, Chuan Fang, Liefeng Bo et al.

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang et al.

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

Tobias Fischer, Lorenzo Porzi, Samuel Rota Bulò et al.

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

Zhongyu Xia, ZhiWei Lin, Xinhao Wang et al.

ECCV 2024arXiv:2404.02517

multi-view cameras3d object detectionbird's-eye-view segmentationtemporal feature integration+4

19

citations

#36

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

Zhuoman Liu, Weicai Ye, Yan Luximon et al.

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Lei Fan, Mingfu Liang, Yunxuan Li et al.

Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models

Matthew Kowal, Richard P. Wildes, Kosta Derpanis

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

Liyuan Zhu, Shengyu Huang, Konrad Schindler et al.

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

Yunzhi Zhang, Zizhang Li, Matt Zhou et al.

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

Jun-Kun Chen, Samuel Rota Bulò, Norman Müller et al.

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Wenqi Jia, Miao Liu, Hao Jiang et al.

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao et al.

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Bizhu Wu, Jinheng Xie, Keming Shen et al.

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Hong Zhang, Yixuan Lyu, Qian Yu et al.

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Xiaoyu Zhu, Hao Zhou, Pengfei Xing et al.

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

Linglin Jing, Ying Xue, Xu Yan et al.

AAAI 2024arXiv:2312.07378

4d point cloud understandingcross-modal knowledge transfertemporal relationship miningaction recognition+4

11

citations

#50

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Yue Fan, Xiaojian Ma, Rongpeng Su et al.

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

Tuo FENG, Wenguan Wang, Ruijie Quan et al.

A Theory of Joint Light and Heat Transport for Lambertian Scenes

Mani Ramanagopal, Sriram Narayanan, Aswin C. Sankaranarayanan et al.

NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization

Danial Kamali, Elham J. Barezi, Parisa Kordjamshidi

MemoNav: Working Memory Model for Visual Navigation

Hongxin Li, Zeyu Wang, Xu Yang et al.

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

Xiaosu Zhu, Hualian Sheng, Sijia Cai et al.

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Qi Wang, Ruijie Lu, Xudong XU et al.

Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views

Ziwei Zhao, Yuchen Wang, Chuhua Wang

UAVScenes: A Multi-Modal Dataset for UAVs

Sijie Wang, Siqi Li, Yawei Zhang et al.

DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

Yiyuan Liang, Zhiying Yan, Liqun Chen et al.

One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception

Bohan Li, Yasheng Sun, Jingxin Dong et al.

AAAI 2024arXiv:2306.12681

multi-view stereosemantic scene completionvolumetric probability learningdiffusion models+2

9

citations

#61

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

Hyeongjun Kwon, Jinhyun Jang, Jin Kim et al.

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

Xianyu Chen, Ming Jiang, Qi Zhao

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

Zhixi Cai, Fucai Ke, Simindokht Jahangard et al.

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Ziheng Zhang, Jianyang Gu, Arpita Chowdhury et al.

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

Anand Bhattad, Konpat Preechakul, Alexei Efros

NeurIPS 2025arXiv:2503.21770

scene understandingcounterfactual inpaintingobject dependenciesscene coherence+3

8

citations

#66

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj et al.

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Yandan Yang, Baoxiong Jia, Shujie Zhang et al.

Towards Scene Graph Anticipation

Rohith Peddi, Saksham Singh, Saurabh . et al.

Understanding Physical Dynamics with Counterfactual World Modeling

Rahul Mysore Venkatesh, Honglin Chen, Kevin Feigelis et al.

SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing

Yingying Zhang, Lixiang Ru, Kang Wu et al.

ICCV 2025arXiv:2507.13812

remote sensing foundation modelmulti-modal learningself-supervised learningtransformer backbone+4

7

citations

#71

SuperPrimitive: Scene Reconstruction at a Primitive Level

Kirill Mazur, Gwangbin Bae, Andrew J. Davison

Towards Generalizable Scene Change Detection

Jae-Woo KIM, Ue-Hwan Kim

DiffGrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Yonghao Zhang, Qiang He, Yanguang Wan et al.

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.

CVPR 2025arXiv:2503.16707

open-vocabulary 3d scene understandingvision-language modelscross-modal aggregationdeterministic uncertainty estimation+4

7

citations

#75

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla et al.

``Principal Components" Enable A New Language of Images

Xin Wen, Bingchen Zhao, Ismail Elezi et al.

Functionality Understanding and Segmentation in 3D Scenes

Jaime Corsetti, Francesco Giuliari, Alice Fasoli et al.

Open-Vocabulary Octree-Graph for 3D Scene Understanding

Zhigang Wang, Yifei Su, Chenhui Li et al.

MOS: Modeling Object-Scene Associations in Generalized Category Discovery

Zhengyuan Peng, Jinpeng Ma, Zhimin Sun et al.

Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation

Yiyuan Pan, Yunzhe Xu, Zhe Liu et al.

Uncertain Multimodal Intention and Emotion Understanding in the Wild

Qu Yang, QingHongYa Shi, Tongxin Wang et al.

Object-aware Sound Source Localization via Audio-Visual Scene Understanding

Sung Jin Um, Dongjin Kim, Sangmin Lee et al.

CVPR 2025arXiv:2506.18557

sound source localizationaudio-visual correspondencemultimodal large language modelsobject-aware contrastive alignment+2

5

citations

#84

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Julian Lorenz, Alexander Pest, Daniel Kienzle et al.

Video Perception Models for 3D Scene Synthesis

Rui Huang, Guangyao Zhai, Zuria Bauer et al.

SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

Byeongjun Park, Hyojun Go, Hyelin Nam et al.

Unleashing Network Potentials for Semantic Scene Completion

Fengyun Wang, Qianru Sun, Dong Zhang et al.

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Divyansh Srivastava, Xiang Zhang, He Wen et al.

ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding

Guangda Ji, Silvan Weder, Francis Engelmann et al.

Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector

Xianren Zhang, Dongwon Lee, Suhang Wang

SocialGesture: Delving into Multi-person Gesture Understanding

Xu Cao, Pranav Virupaksha, Wenqi Jia et al.

One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception

Yuchen Xia, Quan Yuan, Guiyang Luo et al.

Learned Scanpaths Aid Blind Panoramic Video Quality Assessment

Kanglong FAN, Wen Wen, Mu Li et al.

Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

Sai Wang, Yutian Lin, Yu Wu

CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design

Weitao Feng, Hang Zhou, Jing Liao et al.

CVPR 2025arXiv:2504.19478

indoor scene synthesiscuboid primitives3d object arrangementautoregressive scene generation+3

4

citations

#96

Gated Fields: Learning Scene Reconstruction from Gated Videos

Andrea Ramazzina, Stefanie Walz, Pragyan Dahal et al.

Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding

Wenbo Zhang, Lu Zhang, Ping Hu et al.

Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing

Yanjun Li, Zhaoyang Li, Honghui Chen et al.

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting

Ruijie Zhu, Mulin Yu, Linning Xu et al.

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

Shengqiong Wu, Hao Fei, Jingkang Yang et al.

CVPR 2025

4

citations

Scene Understanding

Top Conferences

Related Topics (Vision Recognition)

Top Papers

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

OmniRe: Omni Urban Scene Reconstruction

PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Wonderland: Navigating 3D Scenes from a Single Image

Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations

YOLOE: Real-Time Seeing Anything

Harmonizing Visual Representations for Unified Multimodal Understanding and Generation

Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos

GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

360+x: A Panoptic Multi-modal Scene Understanding Dataset

HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation

PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance

ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras

Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation

EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models

Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments

The Scene Language: Representing Scenes with Programs, Words, and Embeddings

ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities

Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data

A Theory of Joint Light and Heat Transport for Lambertian Scenes

NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization

MemoNav: Working Memory Model for Visual Navigation

RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception

RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting

Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views

UAVScenes: A Multi-Modal Dataset for UAVs

DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes

One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths

NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning

Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation

Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting

FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding

SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent

Towards Scene Graph Anticipation

Understanding Physical Dynamics with Counterfactual World Modeling

SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing

SuperPrimitive: Scene Reconstruction at a Primitive Level

Towards Generalizable Scene Change Detection

DiffGrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model

Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos