OSKAR: Omnimodal Self-supervised Knowledge Abstraction and Representation

0citations

Citations

#2219

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Mohamed Abdelfattah Kaouther Messaoud Alexandre Alahi

Topics

multimodal foundation model bootstrapped latent prediction masked token prediction cross-attention predictor action recognition video-text retrieval video question answering unified pretrained encoder

Abstract

We present OSKAR, the first multimodal foundation model based on bootstrapped latent feature prediction. Unlike generative or contrastive methods, it avoids memorizing unnecessary details (e.g., pixels), and does not require negative pairs, large memory banks, or hand-crafted augmentations. We propose a novel pretraining strategy: given masked tokens from multiple modalities, predict a subset of missing tokens per modality, supervised by momentum-updated uni-modal target encoders. This design efficiently utilizes the model capacity in learning high-level representations while retaining modality-specific information. Further, we propose a scalable design which decouples the compute cost from the number of modalities using a fixed representative token budget—in both input and target tokens—and introduces a parameter-efficient cross-attention predictor that grounds each prediction in the full multimodal context. We instantiate OSKAR on video, skeleton, and text modalities. Extensive experimental results show that OSKAR's unified pretrained encoder outperforms models with specialized architectures of similar size in action recognition (rgb, skeleton, frozen, low-shot) and localization, video-text retrieval, and video question answering. Project website: https://multimodal-oskar.github.io

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 1, 2026