Ivan Laptev

51
Papers
935
Total Citations

Papers (51)

Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks

CVPR 2015
922
citations

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

CVPR 2025
12
citations

Towards Reliable Identification of Diffusion-based Image Manipulations

NeurIPS 2025
1
citations

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

ICCV 2025
0
citations

DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models

NeurIPS 2025arXiv
0
citations

PairDETR : Joint Detection and Association of Human Bodies and Faces

CVPR 2024
0
citations

SUGAR: Pre-training 3D Visual Representations for Robotics

CVPR 2024
0
citations

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

CVPR 2024
0
citations

On Pairwise Costs for Network Flow Multi-Object Tracking

CVPR 2015
0
citations

Instance-Level Video Segmentation From Object Tracks

CVPR 2016
0
citations

Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation

CVPR 2016
0
citations

Learning From Synthetic Humans

CVPR 2017arXiv
0
citations

Deep Metric Learning Beyond Binary Supervision

CVPR 2019
0
citations

Cross-Task Weakly Supervised Learning From Instructional Videos

CVPR 2019
0
citations

Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video

CVPR 2019
0
citations

Learning Joint Reconstruction of Hands and Manipulated Objects

CVPR 2019
0
citations

Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction

CVPR 2020arXiv
0
citations

Action Modifiers: Learning From Adverbs in Instructional Videos

CVPR 2020
0
citations

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

CVPR 2020arXiv
0
citations

Learning Interactions and Relationships Between Movie Characters

CVPR 2020arXiv
0
citations

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

CVPR 2021arXiv
0
citations

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

CVPR 2022
0
citations

Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation

CVPR 2022arXiv
0
citations

TubeDETR: Spatio-Temporal Video Grounding With Transformers

CVPR 2022arXiv
0
citations

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023arXiv
0
citations

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

CVPR 2023arXiv
0
citations

Context-Aware CNNs for Person Head Detection

ICCV 2015
0
citations

Unsupervised Object Discovery and Tracking in Video Collections

ICCV 2015
0
citations

P-CNN: Pose-Based CNN Features for Action Recognition

ICCV 2015
0
citations

Weakly-Supervised Alignment of Video With Text

ICCV 2015
0
citations

Joint Discovery of Object States and Manipulation Actions

ICCV 2017arXiv
0
citations

Weakly-Supervised Learning of Visual Relations

ICCV 2017arXiv
0
citations

Learning From Video and Text via Large-Scale Discriminative Clustering

ICCV 2017arXiv
0
citations

Detecting Unseen Visual Relations Using Analogies

ICCV 2019
0
citations

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

ICCV 2019
0
citations

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

ICCV 2021arXiv
0
citations

Segmenter: Transformer for Semantic Segmentation

ICCV 2021arXiv
0
citations

Airbert: In-Domain Pretraining for Vision-and-Language Navigation

ICCV 2021arXiv
0
citations

Learning Actionness via Long-range Temporal Order Verification

ECCV 2020
0
citations

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

ECCV 2022
0
citations

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

ECCV 2022
0
citations

Unsupervised Learning From Narrated Instruction Videos

CVPR 2016
0
citations

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

CVPR 2025
0
citations

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

CVPR 2025
0
citations

A flexible model for training action localization with varying levels of supervision

NeurIPS 2018
0
citations

History Aware Multimodal Transformer for Vision-and-Language Navigation

NeurIPS 2021
0
citations

XCiT: Cross-Covariance Image Transformers

NeurIPS 2021
0
citations

Differentiable rendering with perturbed optimizers

NeurIPS 2021
0
citations

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

NeurIPS 2022
0
citations

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

NeurIPS 2022
0
citations

VidChapters-7M: Video Chapters at Scale

NeurIPS 2023
0
citations