Ivan Laptev
51
Papers
935
Total Citations
Papers (51)
Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks
CVPR 2015
922
citations
RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation
CVPR 2025
12
citations
Towards Reliable Identification of Diffusion-based Image Manipulations
NeurIPS 2025
1
citations
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
ICCV 2025
0
citations
DEFT: Decompositional Efficient Fine-Tuning for Text-to-Image Models
NeurIPS 2025arXiv
0
citations
PairDETR : Joint Detection and Association of Human Bodies and Faces
CVPR 2024
0
citations
SUGAR: Pre-training 3D Visual Representations for Robotics
CVPR 2024
0
citations
GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
CVPR 2024
0
citations
On Pairwise Costs for Network Flow Multi-Object Tracking
CVPR 2015
0
citations
Instance-Level Video Segmentation From Object Tracks
CVPR 2016
0
citations
Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation
CVPR 2016
0
citations
Learning From Synthetic Humans
CVPR 2017arXiv
0
citations
Deep Metric Learning Beyond Binary Supervision
CVPR 2019
0
citations
Cross-Task Weakly Supervised Learning From Instructional Videos
CVPR 2019
0
citations
Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video
CVPR 2019
0
citations
Learning Joint Reconstruction of Hands and Manipulated Objects
CVPR 2019
0
citations
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction
CVPR 2020arXiv
0
citations
Action Modifiers: Learning From Adverbs in Instructional Videos
CVPR 2020
0
citations
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
CVPR 2020arXiv
0
citations
Learning Interactions and Relationships Between Movie Characters
CVPR 2020arXiv
0
citations
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers
CVPR 2021arXiv
0
citations
Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos
CVPR 2022
0
citations
Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation
CVPR 2022arXiv
0
citations
TubeDETR: Spatio-Temporal Video Grounding With Transformers
CVPR 2022arXiv
0
citations
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023arXiv
0
citations
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
CVPR 2023arXiv
0
citations
Context-Aware CNNs for Person Head Detection
ICCV 2015
0
citations
Unsupervised Object Discovery and Tracking in Video Collections
ICCV 2015
0
citations
P-CNN: Pose-Based CNN Features for Action Recognition
ICCV 2015
0
citations
Weakly-Supervised Alignment of Video With Text
ICCV 2015
0
citations
Joint Discovery of Object States and Manipulation Actions
ICCV 2017arXiv
0
citations
Weakly-Supervised Learning of Visual Relations
ICCV 2017arXiv
0
citations
Learning From Video and Text via Large-Scale Discriminative Clustering
ICCV 2017arXiv
0
citations
Detecting Unseen Visual Relations Using Analogies
ICCV 2019
0
citations
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
ICCV 2019
0
citations
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
ICCV 2021arXiv
0
citations
Segmenter: Transformer for Semantic Segmentation
ICCV 2021arXiv
0
citations
Airbert: In-Domain Pretraining for Vision-and-Language Navigation
ICCV 2021arXiv
0
citations
Learning Actionness via Long-range Temporal Order Verification
ECCV 2020
0
citations
AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction
ECCV 2022
0
citations
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
ECCV 2022
0
citations
Unsupervised Learning From Narrated Instruction Videos
CVPR 2016
0
citations
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
CVPR 2025
0
citations
All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages
CVPR 2025
0
citations
A flexible model for training action localization with varying levels of supervision
NeurIPS 2018
0
citations
History Aware Multimodal Transformer for Vision-and-Language Navigation
NeurIPS 2021
0
citations
XCiT: Cross-Covariance Image Transformers
NeurIPS 2021
0
citations
Differentiable rendering with perturbed optimizers
NeurIPS 2021
0
citations
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
NeurIPS 2022
0
citations
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
NeurIPS 2022
0
citations
VidChapters-7M: Video Chapters at Scale
NeurIPS 2023
0
citations