Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

0citations

PDF

Citations

#1301

in ECCV 2024

of 2387 papers

Authors

Data Points

Authors

Kwanyong Park Kuniaki Saito Donghyun Kim

Abstract

Vision-language (VL) models have shown to be very effective in a variety of object detection tasks by utilizing weakly supervised image-text pairs from the web. However, these models exhibit a limited understanding of complex compositions of visual objects (e.g., attributes, shapes, and their relations), resulting in a significant performance drop given complex and diverse language queries. While conventional methods try to enhance VL models through the use of hard negative synthetic augmentation on the text domain, their effectiveness remains restricted without densely paired image-text augmentation. In this paper, we introduce a structured synthetic data generation approach to improve the compositional understanding of VL models for language-based object detection, which generates densely paired positive and negative triplets (object, text descriptions, bounding boxes) in both image and text domains. In addition, in order to train VL models effectively, we propose a new compositional contrastive learning formulation that discovers semantics and structures in complex descriptions from synthetic triplets. As a result, VL models trained with our synthetic data generation exhibit a significant performance boost in the Omnilabel benchmark by up to +5AP and the D3 benchmark by +6.9AP upon existing baselines.

Citation History

Jan 25, 2026

Jan 27, 2026

Jan 28, 2026