Ivan Laptev

51

Papers

935

Total Citations

Papers (51)

Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

Towards Reliable Identification of Diffusion-based Image Manipulations

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

NeurIPS 2025arXiv

PairDETR : Joint Detection and Association of Human Bodies and Faces

SUGAR: Pre-training 3D Visual Representations for Robotics

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

On Pairwise Costs for Network Flow Multi-Object Tracking

Instance-Level Video Segmentation From Object Tracks

Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation

Learning From Synthetic Humans

Deep Metric Learning Beyond Binary Supervision

Cross-Task Weakly Supervised Learning From Instructional Videos

Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video

Learning Joint Reconstruction of Hands and Manipulated Objects

Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction

Action Modifiers: Learning From Adverbs in Instructional Videos

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Learning Interactions and Relationships Between Movie Characters

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation

TubeDETR: Spatio-Temporal Video Grounding With Transformers

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

Context-Aware CNNs for Person Head Detection

Unsupervised Object Discovery and Tracking in Video Collections

P-CNN: Pose-Based CNN Features for Action Recognition

Weakly-Supervised Alignment of Video With Text

Joint Discovery of Object States and Manipulation Actions

Weakly-Supervised Learning of Visual Relations

Learning From Video and Text via Large-Scale Discriminative Clustering

Detecting Unseen Visual Relations Using Analogies

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Segmenter: Transformer for Semantic Segmentation

Airbert: In-Domain Pretraining for Vision-and-Language Navigation

Learning Actionness via Long-range Temporal Order Verification

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

Unsupervised Learning From Narrated Instruction Videos

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

A flexible model for training action localization with varying levels of supervision

History Aware Multimodal Transformer for Vision-and-Language Navigation

XCiT: Cross-Covariance Image Transformers

Differentiable rendering with perturbed optimizers

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

VidChapters-7M: Video Chapters at Scale