Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

0citations
PDF
0
Citations
#1301
in ECCV 2024
of 2387 papers
3
Authors
4
Data Points

Abstract

Vision-language (VL) models have shown to be very effective in a variety of object detection tasks by utilizing weakly supervised image-text pairs from the web. However, these models exhibit a limited understanding of complex compositions of visual objects (e.g., attributes, shapes, and their relations), resulting in a significant performance drop given complex and diverse language queries. While conventional methods try to enhance VL models through the use of hard negative synthetic augmentation on the text domain, their effectiveness remains restricted without densely paired image-text augmentation. In this paper, we introduce a structured synthetic data generation approach to improve the compositional understanding of VL models for language-based object detection, which generates densely paired positive and negative triplets (object, text descriptions, bounding boxes) in both image and text domains. In addition, in order to train VL models effectively, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 28, 2026
0