Josef Sivic

48

Papers

1,144

Total Citations

Papers (48)

Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks

Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions

Learning to design protein-protein interactions with enhanced generalization

Learning to engineer protein flexibility

Improving Personalized Search with Regularized Low-Rank Parameter Updates

24/7 Place Recognition by View Synthesis

On Pairwise Costs for Network Flow Multi-Object Tracking

Unsupervised Learning From Narrated Instruction Videos

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization?

Convolutional Neural Network Architecture for Geometric Matching

End-to-End Weakly-Supervised Semantic Alignment

InLoc: Indoor Visual Localization With Dense Matching and View Synthesis

Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions

Cross-Task Weakly Supervised Learning From Instructional Videos

D2-Net: A Trainable CNN for Joint Description and Detection of Local Features

Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

Single-View Robot Pose and Joint Angle Estimation via Render & Compare

Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers

Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos

Focal Length and Object Pose Estimation via Render and Compare

TubeDETR: Spatio-Temporal Video Grounding With Transformers

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Meta-Personalizing Vision-Language Models To Find Named Instances in Video

Language-Guided Music Recommendation for Video via Prompt Analogies

Joint Discovery of Object States and Manipulation Actions

Weakly-Supervised Learning of Visual Relations

Learning From Video and Text via Large-Scale Discriminative Clustering

Localizing Moments in Video With Natural Language

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Is This the Right Place? Geometric-Semantic Pose Verification for Indoor Visual Localization

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions

CosyPose: Consistent multi-view multi-object 6D pose estimation

Learning Actionness via Long-range Temporal Order Verification

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation

Detecting Unseen Visual Relations Using Analogies

Discovering Divergent Representations between Text-to-Image Models

Large-scale Pre-training for Grounded Video Caption Generation

ResidualViT for Efficient Temporally Dense Video Encoding

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos

Neighbourhood Consensus Networks

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

VidChapters-7M: Video Chapters at Scale

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images