Scene Understanding
Holistic understanding of visual scenes
Top Papers
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting
Zeyu Yang, Hongye Yang, Zijie Pan et al.
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
Yihua Huang, Yangtian Sun, Ziyi Yang et al.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo, Jiangwei Lao, Bo Dang et al.
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
Jin-Chuan Shi, Miao Wang, Haobin Duan et al.
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang, Min Shi, Qingyun Li et al.
OmniRe: Omni Urban Scene Reconstruction
Ziyu Chen, Jiawei Yang, Jiahui Huang et al.
PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment
Tianchen Deng, Guole Shen, Tong Qin et al.
Unified Human-Scene Interaction via Prompted Chain-of-Contacts
Zeqi Xiao, Tai Wang, Jingbo Wang et al.
Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation
Shuting He, Henghui Ding
PerceptionGPT: Effectively Fusing Visual Perception into LLM
Renjie Pi, Lewei Yao, Jiahui Gao et al.
Wonderland: Navigating 3D Scenes from a Single Image
Hanwen Liang, Junli Cao, Vidit Goel et al.
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations
Yufeng Huang, Jiji Tang, Zhuo Chen et al.
YOLOE: Real-Time Seeing Anything
Ao Wang, Lihao Liu, Hui Chen et al.
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
Size Wu, Wenwei Zhang, Lumin Xu et al.
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion
Bohan Li, Jiajun Deng, Wenyao Zhang et al.
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation
Haiming Zhang, Xu Yan, Dongfeng Bai et al.
Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos
Weifeng Lin, Xinyu Wei, Ruichuan An et al.
Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
Ri-Zhao Qiu, Ge Yang, Weijia Zeng et al.
GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding
Hao Li, Dingwen Zhang, Yalun Dai et al.
VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors
Sungwon Hwang, Min-Jung Kim, Taewoong Kang et al.
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Trong-Thuan Nguyen, Pha Nguyen, Khoa Luu
Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning
Xinshun Wang, Zhongbin Fang, Xia Li et al.
Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion
Linzhan Mou, Jun-Kun Chen, Yu-Xiong Wang
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences
Hongyan Zhi, Peihao Chen, Junyan Li et al.
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
ZIYU ZHU, Xilin Wang, Yixuan Li et al.
360+x: A Panoptic Multi-modal Scene Understanding Dataset
Hao Chen, Yuqi Hou, Chenyuan Qu et al.
HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation
Ce Zhang, Simon Stepputtis, Joseph Campbell et al.
PanoContext-Former: Panoramic Total Scene Understanding with a Transformer
Yuan Dong, Chuan Fang, Liefeng Bo et al.
Open-Vocabulary Functional 3D Scene Graphs for Real-World Indoor Spaces
Chenyangguang Zhang, Alexandros Delitzas, Fangjinhua Wang et al.
Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Razvan Pasca, Alexey Gavryushin, Muhammad Hamza et al.
DreamActor-M1: Holistic, Expressive and Robust Human Image Animation with Hybrid Guidance
Yuxuan Luo, Zhengkun Rong, Lizhen Wang et al.
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation
Guosheng Zhao, Xiaofeng Wang, Chaojun Ni et al.
Multi-Level Neural Scene Graphs for Dynamic Urban Environments
Tobias Fischer, Lorenzo Porzi, Samuel Rota Bulò et al.
HENet: Hybrid Encoding for End-to-end Multi-task 3D Perception from Multi-view Cameras
Zhongyu Xia, ZhiWei Lin, Xinhao Wang et al.
Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation
Zhuoman Liu, Weicai Ye, Yan Luximon et al.
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.
Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception
Lei Fan, Mingfu Liang, Yunxuan Li et al.
Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models
Matthew Kowal, Richard P. Wildes, Kosta Derpanis
Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments
Liyuan Zhu, Shengyu Huang, Konrad Schindler et al.
The Scene Language: Representing Scenes with Programs, Words, and Embeddings
Yunzhi Zhang, Zizhang Li, Matt Zhou et al.
ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing
Jun-Kun Chen, Samuel Rota Bulò, Norman Müller et al.
The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective
Wenqi Jia, Miao Liu, Hao Jiang et al.
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Yingjie Chen, Yifang Men, Yuan Yao et al.
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat
MG-MotionLLM: A Unified Framework for Motion Comprehension and Generation across Multiple Granularities
Bizhu Wu, Jinheng Xie, Keming Shen et al.
Unlocking Attributes' Contribution to Successful Camouflage: A Combined Textual and Visual Analysis Strategy
Hong Zhang, Yixuan Lyu, Qian Yu et al.
Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models
Xiaoyu Zhu, Hao Zhou, Pengfei Xing et al.
X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer
Linglin Jing, Ying Xue, Xu Yan et al.
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Yue Fan, Xiaojian Ma, Rongpeng Su et al.
Shape2Scene: 3D Scene Representation Learning Through Pre-training on Shape Data
Tuo FENG, Wenguan Wang, Ruijie Quan et al.
A Theory of Joint Light and Heat Transport for Lambertian Scenes
Mani Ramanagopal, Sriram Narayanan, Aswin C. Sankaranarayanan et al.
NeSyCoCo: A Neuro-Symbolic Concept Composer for Compositional Generalization
Danial Kamali, Elham J. Barezi, Parisa Kordjamshidi
MemoNav: Working Memory Model for Visual Navigation
Hongxin Li, Zeyu Wang, Xu Yang et al.
RoScenes: A Large-scale Multi-view 3D Dataset for Roadside Perception
Xiaosu Zhu, Hualian Sheng, Sijia Cai et al.
RoomTex: Texturing Compositional Indoor Scenes via Iterative Inpainting
Qi Wang, Ruijie Lu, Xudong XU et al.
Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views
Ziwei Zhao, Yuchen Wang, Chuhua Wang
UAVScenes: A Multi-Modal Dataset for UAVs
Sijie Wang, Siqi Li, Yawei Zhang et al.
DriveEditor: A Unified 3D Information-Guided Framework for Controllable Object Editing in Driving Scenes
Yiyuan Liang, Zhiying Yan, Liqun Chen et al.
One at a Time: Progressive Multi-Step Volumetric Probability Learning for Reliable 3D Scene Perception
Bohan Li, Yasheng Sun, Jingxin Dong et al.
Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Hyeongjun Kwon, Jinhyun Jang, Jin Kim et al.
GazeXplain: Learning to Predict Natural Language Explanations of Visual Scanpaths
Xianyu Chen, Ming Jiang, Qi Zhao
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Zhixi Cai, Fucai Ke, Simindokht Jahangard et al.
Finer-CAM: Spotting the Difference Reveals Finer Details for Visual Explanation
Ziheng Zhang, Jianyang Gu, Arpita Chowdhury et al.
Visual Jenga: Discovering Object Dependencies via Counterfactual Inpainting
Anand Bhattad, Konpat Preechakul, Alexei Efros
FALCON: Fairness Learning via Contrastive Attention Approach to Continual Semantic Scene Understanding
Thanh-Dat Truong, Utsav Prabhu, Bhiksha Raj et al.
SceneWeaver: All-in-One 3D Scene Synthesis with an Extensible and Self-Reflective Agent
Yandan Yang, Baoxiong Jia, Shujie Zhang et al.
Towards Scene Graph Anticipation
Rohith Peddi, Saksham Singh, Saurabh . et al.
Understanding Physical Dynamics with Counterfactual World Modeling
Rahul Mysore Venkatesh, Honglin Chen, Kevin Feigelis et al.
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
Yingying Zhang, Lixiang Ru, Kang Wu et al.
SuperPrimitive: Scene Reconstruction at a Primitive Level
Kirill Mazur, Gwangbin Bae, Andrew J. Davison
Towards Generalizable Scene Change Detection
Jae-Woo KIM, Ue-Hwan Kim
DiffGrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model
Yonghao Zhang, Qiang He, Yanguang Wan et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.
Spherical World-Locking for Audio-Visual Localization in Egocentric Videos
Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla et al.
``Principal Components" Enable A New Language of Images
Xin Wen, Bingchen Zhao, Ismail Elezi et al.
Functionality Understanding and Segmentation in 3D Scenes
Jaime Corsetti, Francesco Giuliari, Alice Fasoli et al.
Open-Vocabulary Octree-Graph for 3D Scene Understanding
Zhigang Wang, Yifei Su, Chenhui Li et al.
MOS: Modeling Object-Scene Associations in Generalized Category Discovery
Zhengyuan Peng, Jinpeng Ma, Zhimin Sun et al.
Planning from Imagination: Episodic Simulation and Episodic Memory for Vision-and-Language Navigation
Yiyuan Pan, Yunzhe Xu, Zhe Liu et al.
Uncertain Multimodal Intention and Emotion Understanding in the Wild
Qu Yang, QingHongYa Shi, Tongxin Wang et al.
Object-aware Sound Source Localization via Audio-Visual Scene Understanding
Sung Jin Um, Dongjin Kim, Sangmin Lee et al.
A Fair Ranking and New Model for Panoptic Scene Graph Generation
Julian Lorenz, Alexander Pest, Daniel Kienzle et al.
Video Perception Models for 3D Scene Synthesis
Rui Huang, Guangyao Zhai, Zuria Bauer et al.
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering
Byeongjun Park, Hyojun Go, Hyelin Nam et al.
Unleashing Network Potentials for Semantic Scene Completion
Fengyun Wang, Qianru Sun, Dong Zhang et al.
Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers
Divyansh Srivastava, Xiang Zhang, He Wen et al.
ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding
Guangda Ji, Silvan Weder, Francis Engelmann et al.
Comprehensive Attribution: Inherently Explainable Vision Model with Feature Detector
Xianren Zhang, Dongwon Lee, Suhang Wang
SocialGesture: Delving into Multi-person Gesture Understanding
Xu Cao, Pranav Virupaksha, Wenqi Jia et al.
One is Plenty: A Polymorphic Feature Interpreter for Immutable Heterogeneous Collaborative Perception
Yuchen Xia, Quan Yuan, Guiyang Luo et al.
Learned Scanpaths Aid Blind Panoramic Video Quality Assessment
Kanglong FAN, Wen Wen, Mu Li et al.
Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding
Sai Wang, Yutian Lin, Yu Wu
CASAGPT: Cuboid Arrangement and Scene Assembly for Interior Design
Weitao Feng, Hang Zhou, Jing Liao et al.
Gated Fields: Learning Scene Reconstruction from Gated Videos
Andrea Ramazzina, Stefanie Walz, Pragyan Dahal et al.
Bootstraping Clustering of Gaussians for View-consistent 3D Scene Understanding
Wenbo Zhang, Lu Zhang, Ping Hu et al.
Unbiased Video Scene Graph Generation via Visual and Semantic Dual Debiasing
Yanjun Li, Zhaoyang Li, Honghui Chen et al.
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
Ruijie Zhu, Mulin Yu, Linning Xu et al.
Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene
Shengqiong Wu, Hao Fei, Jingkang Yang et al.