Cordelia Schmid
112
Papers
685
Total Citations
Papers (112)
Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals
CVPR 2015
289
citations
MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild
NeurIPS 2016arXiv
281
citations
Learning Correlation Structures for Vision Transformers
CVPR 2024
25
citations
Graph convolutional networks for learning with few clean and many noisy labels
ECCV 2020
23
citations
DataDream: Few-shot Guided Dataset Generation
ECCV 2024
23
citations
Language-Guided Image Tokenization for Generation
CVPR 2025arXiv
23
citations
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
CVPR 2025
7
citations
Consistency Guided Scene Flow Estimation
ECCV 2020
7
citations
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
CVPR 2024
7
citations
End-to-End Spatio-Temporal Action Localisation with Video Transformers
CVPR 2024
0
citations
Dense Optical Tracking: Connecting the Dots
CVPR 2024
0
citations
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
0
citations
SUGAR: Pre-training 3D Visual Representations for Robotics
CVPR 2024
0
citations
Pixel-Aligned Language Model
CVPR 2024
0
citations
Time- Memory- and Parameter-Efficient Visual Adaptation
CVPR 2024
0
citations
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
ICML 2024
0
citations
EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow
CVPR 2015
0
citations
Learning to Detect Motion Boundaries
CVPR 2015
0
citations
Proposal Flow
CVPR 2016
0
citations
Learning From Synthetic Humans
CVPR 2017arXiv
0
citations
Learning Motion Patterns in Videos
CVPR 2017arXiv
0
citations
LCR-Net: Localization-Classification-Regression for Human Pose
CVPR 2017
0
citations
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
CVPR 2025
0
citations
PoTion: Pose MoTion Representation for Action Recognition
CVPR 2018
0
citations
Actor and Observer: Joint Modeling of First and Third-Person Videos
CVPR 2018arXiv
0
citations
Relational Action Forecasting
CVPR 2019
0
citations
MARS: Motion-Augmented RGB Stream for Action Recognition
CVPR 2019
0
citations
A Structured Model for Action Detection
CVPR 2019
0
citations
Learning Joint Reconstruction of Hands and Manipulated Objects
CVPR 2019
0
citations
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction
CVPR 2020arXiv
0
citations
Speech2Action: Cross-Modal Supervision for Action Recognition
CVPR 2020arXiv
0
citations
VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
CVPR 2020arXiv
0
citations
HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps
CVPR 2021
0
citations
Look Before You Speak: Visually Contextualized Utterances
CVPR 2021arXiv
0
citations
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022arXiv
0
citations
Multiview Transformers for Video Recognition
CVPR 2022arXiv
0
citations
Learning With Neighbor Consistency for Noisy Labels
CVPR 2022arXiv
0
citations
Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation
CVPR 2022arXiv
0
citations
TubeDETR: Spatio-Temporal Video Grounding With Transformers
CVPR 2022arXiv
0
citations
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023arXiv
0
citations
Improving Image Recognition by Retrieving From Web-Scale Image-Text Data
CVPR 2023arXiv
0
citations
Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification
CVPR 2023arXiv
0
citations
REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
CVPR 2023arXiv
0
citations
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
CVPR 2023arXiv
0
citations
How Can Objects Help Action Recognition?
CVPR 2023
0
citations
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
CVPR 2023arXiv
0
citations
Local Convolutional Features With Unsupervised Training for Image Retrieval
ICCV 2015
0
citations
Online Object Tracking With Proposal Selection
ICCV 2015
0
citations
Learning to Track for Spatio-Temporal Action Localization
ICCV 2015
0
citations
Unsupervised Object Discovery and Tracking in Video Collections
ICCV 2015
0
citations
P-CNN: Pose-Based CNN Features for Action Recognition
ICCV 2015
0
citations
Weakly-Supervised Alignment of Video With Text
ICCV 2015
0
citations
Areas of Attention for Image Captioning
ICCV 2017arXiv
0
citations
SCNet: Learning Semantic Correspondence
ICCV 2017arXiv
0
citations
Incremental Learning of Object Detectors Without Catastrophic Forgetting
ICCV 2017arXiv
0
citations
BlitzNet: A Real-Time Deep Network for Scene Understanding
ICCV 2017arXiv
0
citations
Joint Learning of Object and Action Detectors
ICCV 2017
0
citations
Action Tubelet Detector for Spatio-Temporal Action Localization
ICCV 2017arXiv
0
citations
Learning Video Object Segmentation With Visual Memory
ICCV 2017arXiv
0
citations
Weakly-Supervised Learning of Visual Relations
ICCV 2017arXiv
0
citations
Detecting Unseen Visual Relations Using Analogies
ICCV 2019
0
citations
Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images
ICCV 2019
0
citations
Diversity With Cooperation: Ensemble Methods for Few-Shot Classification
ICCV 2019
0
citations
Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera
ICCV 2019
0
citations
VideoBERT: A Joint Model for Video and Language Representation Learning
ICCV 2019
0
citations
Composable Augmentation Encoding for Video Representation Learning
ICCV 2021arXiv
0
citations
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
ICCV 2021arXiv
0
citations
Episodic Transformer for Vision-and-Language Navigation
ICCV 2021arXiv
0
citations
ViViT: A Video Vision Transformer
ICCV 2021arXiv
0
citations
Segmenter: Transformer for Semantic Segmentation
ICCV 2021arXiv
0
citations
Learning Temporal Dynamics From Cycles in Narrated Video
ICCV 2021arXiv
0
citations
Improving Robustness Against Common Corruptions With Frequency Biased Models
ICCV 2021arXiv
0
citations
Unified Graph Structured Models for Video Understanding
ICCV 2021arXiv
0
citations
Airbert: In-Domain Pretraining for Vision-and-Language Navigation
ICCV 2021arXiv
0
citations
Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts
ICCV 2023arXiv
0
citations
Verbs in Action: Improving Verb Understanding in Video-Language Models
ICCV 2023arXiv
0
citations
WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction
ICCV 2023arXiv
0
citations
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023arXiv
0
citations
Audiovisual Masked Autoencoders
ICCV 2023arXiv
0
citations
Multi-modal Transformer for Video Retrieval
ECCV 2020
0
citations
TAO: A Large-Scale Benchmark for Tracking Any Object
ECCV 2020
0
citations
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
0
citations
Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification
ECCV 2020
0
citations
Memory-Efficient Incremental Learning Through Feature Adaptation
ECCV 2020
0
citations
AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction
ECCV 2022
0
citations
Learning Audio-Video Modalities from Image Captions
ECCV 2022
0
citations
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
ECCV 2022
0
citations
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
ECCV 2022
0
citations
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
CVPR 2018arXiv
0
citations
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
0
citations
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
CVPR 2025
0
citations
Visual Lexicon: Rich Image Features in Language Space
CVPR 2025
0
citations
MINERVA: Evaluating Complex Video Reasoning
ICCV 2025
0
citations
Large-scale Pre-training for Grounded Video Caption Generation
ICCV 2025
0
citations
HORT: Monocular Hand-held Objects Reconstruction with Transformers
ICCV 2025
0
citations
CoVR: Learning Composed Video Retrieval from Web Video Captions
AAAI 2024
0
citations
Streaming Dense Video Captioning
CVPR 2024
0
citations
Unsupervised Learning of Artistic Styles with Archetypal Style Analysis
NeurIPS 2018
0
citations
A flexible model for training action localization with varying levels of supervision
NeurIPS 2018
0
citations
Adaptive Density Estimation for Generative Models
NeurIPS 2019
0
citations
What Makes for Good Views for Contrastive Learning?
NeurIPS 2020
0
citations
History Aware Multimodal Transformer for Vision-and-Language Navigation
NeurIPS 2021
0
citations
CCVS: Context-aware Controllable Video Synthesis
NeurIPS 2021
0
citations
Attention Bottlenecks for Multimodal Fusion
NeurIPS 2021
0
citations
Large-Scale Unsupervised Object Discovery
NeurIPS 2021
0
citations
Differentiable rendering with perturbed optimizers
NeurIPS 2021
0
citations
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
NeurIPS 2022
0
citations
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
NeurIPS 2022
0
citations
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
NeurIPS 2023
0
citations
Does Visual Pretraining Help End-to-End Reasoning?
NeurIPS 2023
0
citations
VidChapters-7M: Video Chapters at Scale
NeurIPS 2023
0
citations
White-box vs Black-box: Bayes Optimal Strategies for Membership Inference
ICML 2019
0
citations