Josef Sivic

48
Papers
1,144
Total Citations

Papers (48)

Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks

CVPR 2015
922
citations

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

ECCV 2020
192
citations

Learning to design protein-protein interactions with enhanced generalization

ICLR 2024
25
citations

Learning to engineer protein flexibility

ICLR 2025arXiv
4
citations

Improving Personalized Search with Regularized Low-Rank Parameter Updates

CVPR 2025
1
citations

24/7 Place Recognition by View Synthesis

CVPR 2015
0
citations

On Pairwise Costs for Network Flow Multi-Object Tracking

CVPR 2015
0
citations

Unsupervised Learning From Narrated Instruction Videos

CVPR 2016
0
citations

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

CVPR 2016
0
citations

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

CVPR 2017arXiv
0
citations

Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?

CVPR 2017
0
citations

Convolutional Neural Network Architecture for Geometric Matching

CVPR 2017arXiv
0
citations

End-to-End Weakly-Supervised Semantic Alignment

CVPR 2018arXiv
0
citations

InLoc: Indoor Visual Localization With Dense Matching and View Synthesis

CVPR 2018arXiv
0
citations

Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

CVPR 2018arXiv
0
citations

Cross-Task Weakly Supervised Learning From Instructional Videos

CVPR 2019
0
citations

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

CVPR 2019
0
citations

Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video

CVPR 2019
0
citations

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

CVPR 2020arXiv
0
citations

Single-View Robot Pose and Joint Angle Estimation via Render & Compare

CVPR 2021arXiv
0
citations

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

CVPR 2021arXiv
0
citations

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

CVPR 2022
0
citations

Focal Length and Object Pose Estimation via Render and Compare

CVPR 2022
0
citations

TubeDETR: Spatio-Temporal Video Grounding With Transformers

CVPR 2022arXiv
0
citations

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023arXiv
0
citations

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

CVPR 2023
0
citations

Language-Guided Music Recommendation for Video via Prompt Analogies

CVPR 2023
0
citations

Joint Discovery of Object States and Manipulation Actions

ICCV 2017arXiv
0
citations

Weakly-Supervised Learning of Visual Relations

ICCV 2017arXiv
0
citations

Learning From Video and Text via Large-Scale Discriminative Clustering

ICCV 2017arXiv
0
citations

Localizing Moments in Video With Natural Language

ICCV 2017arXiv
0
citations

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

CVPR 2025
0
citations

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

ICCV 2019
0
citations

Is This the Right Place? Geometric-Semantic Pose Verification for Indoor Visual Localization

ICCV 2019
0
citations

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

ICCV 2021arXiv
0
citations

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

ICCV 2021arXiv
0
citations

CosyPose: Consistent multi-view multi-object 6D pose estimation

ECCV 2020
0
citations

Learning Actionness via Long-range Temporal Order Verification

ECCV 2020
0
citations

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

ECCV 2022
0
citations

Detecting Unseen Visual Relations Using Analogies

ICCV 2019
0
citations

Discovering Divergent Representations between Text-to-Image Models

ICCV 2025
0
citations

Large-scale Pre-training for Grounded Video Caption Generation

ICCV 2025
0
citations

ResidualViT for Efficient Temporally Dense Video Encoding

ICCV 2025
0
citations

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

CVPR 2024
0
citations

Neighbourhood Consensus Networks

NeurIPS 2018
0
citations

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

NeurIPS 2022
0
citations

VidChapters-7M: Video Chapters at Scale

NeurIPS 2023
0
citations

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images

NeurIPS 2023
0
citations