DINO-Foresight: Looking into the Future with DINO

15citations

arXiv:2412.11673 Project

Citations

#216

in NeurIPS 2025

of 5858 papers

Authors

Data Points

Authors

Efstathios Karypidis Ioannis Kakogeorgiou Spyridon Gidaris Nikos Komodakis

Topics

vision foundation models semantic feature space future dynamics prediction masked feature transformer self-supervised learning scene understanding tasks

Abstract

Predicting future dynamics is crucial for applications like autonomous driving and robotics, where understanding the environment is key. Existing pixel-level methods are computationally expensive and often focus on irrelevant details. To address these challenges, we introduce DINO-Foresight, a novel framework that operates in the semantic feature space of pretrained Vision Foundation Models (VFMs). Our approach trains a masked feature transformer in a self-supervised manner to predict the evolution of VFM features over time. By forecasting these features, we can apply off-the-shelf, task-specific heads for various scene understanding tasks. In this framework, VFM features are treated as a latent space, to which different heads attach to perform specific tasks for future-frame analysis. Extensive experiments show the very strong performance, robustness and scalability of our framework. Project page and code at https://dino-foresight.github.io/ .

Citation History

Jan 26, 2026

Feb 1, 2026