Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection

8citations

arXiv:2403.14270 PDF

Citations

#767

in ECCV 2024

of 2387 papers

Authors

Data Points

Authors

Tim Salzmann Markus Ryll Alex Bewley Matthias Minderer

Topics

visual relationship detection open-vocabulary detection transformer-based encoder attention mechanism object detection scene understanding zero-shot performance single-stage training

Abstract

Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide ablations, real-world qualitative examples, and analyses of zero-shot performance.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026

8+8