Cordelia Schmid

112
Papers
685
Total Citations

Papers (112)

Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals

CVPR 2015
289
citations

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

NeurIPS 2016arXiv
281
citations

Learning Correlation Structures for Vision Transformers

CVPR 2024
25
citations

Graph convolutional networks for learning with few clean and many noisy labels

ECCV 2020
23
citations

DataDream: Few-shot Guided Dataset Generation

ECCV 2024
23
citations

Language-Guided Image Tokenization for Generation

CVPR 2025arXiv
23
citations

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

CVPR 2025
7
citations

Consistency Guided Scene Flow Estimation

ECCV 2020
7
citations

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

CVPR 2024
7
citations

End-to-End Spatio-Temporal Action Localisation with Video Transformers

CVPR 2024
0
citations

Dense Optical Tracking: Connecting the Dots

CVPR 2024
0
citations

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

CVPR 2024
0
citations

SUGAR: Pre-training 3D Visual Representations for Robotics

CVPR 2024
0
citations

Pixel-Aligned Language Model

CVPR 2024
0
citations

Time- Memory- and Parameter-Efficient Visual Adaptation

CVPR 2024
0
citations

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

ICML 2024
0
citations

EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow

CVPR 2015
0
citations

Learning to Detect Motion Boundaries

CVPR 2015
0
citations

Proposal Flow

CVPR 2016
0
citations

Learning From Synthetic Humans

CVPR 2017arXiv
0
citations

Learning Motion Patterns in Videos

CVPR 2017arXiv
0
citations

LCR-Net: Localization-Classification-Regression for Human Pose

CVPR 2017
0
citations

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

CVPR 2025
0
citations

PoTion: Pose MoTion Representation for Action Recognition

CVPR 2018
0
citations

Actor and Observer: Joint Modeling of First and Third-Person Videos

CVPR 2018arXiv
0
citations

Relational Action Forecasting

CVPR 2019
0
citations

MARS: Motion-Augmented RGB Stream for Action Recognition

CVPR 2019
0
citations

A Structured Model for Action Detection

CVPR 2019
0
citations

Learning Joint Reconstruction of Hands and Manipulated Objects

CVPR 2019
0
citations

Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction

CVPR 2020arXiv
0
citations

Speech2Action: Cross-Modal Supervision for Action Recognition

CVPR 2020arXiv
0
citations

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

CVPR 2020arXiv
0
citations

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

CVPR 2021
0
citations

Look Before You Speak: Visually Contextualized Utterances

CVPR 2021arXiv
0
citations

End-to-End Generative Pretraining for Multimodal Video Captioning

CVPR 2022arXiv
0
citations

Multiview Transformers for Video Recognition

CVPR 2022arXiv
0
citations

Learning With Neighbor Consistency for Noisy Labels

CVPR 2022arXiv
0
citations

Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation

CVPR 2022arXiv
0
citations

TubeDETR: Spatio-Temporal Video Grounding With Transformers

CVPR 2022arXiv
0
citations

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023arXiv
0
citations

Improving Image Recognition by Retrieving From Web-Scale Image-Text Data

CVPR 2023arXiv
0
citations

Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification

CVPR 2023arXiv
0
citations

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory

CVPR 2023arXiv
0
citations

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR

CVPR 2023arXiv
0
citations

How Can Objects Help Action Recognition?

CVPR 2023
0
citations

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

CVPR 2023arXiv
0
citations

Local Convolutional Features With Unsupervised Training for Image Retrieval

ICCV 2015
0
citations

Online Object Tracking With Proposal Selection

ICCV 2015
0
citations

Learning to Track for Spatio-Temporal Action Localization

ICCV 2015
0
citations

Unsupervised Object Discovery and Tracking in Video Collections

ICCV 2015
0
citations

P-CNN: Pose-Based CNN Features for Action Recognition

ICCV 2015
0
citations

Weakly-Supervised Alignment of Video With Text

ICCV 2015
0
citations

Areas of Attention for Image Captioning

ICCV 2017arXiv
0
citations

SCNet: Learning Semantic Correspondence

ICCV 2017arXiv
0
citations

Incremental Learning of Object Detectors Without Catastrophic Forgetting

ICCV 2017arXiv
0
citations

BlitzNet: A Real-Time Deep Network for Scene Understanding

ICCV 2017arXiv
0
citations

Joint Learning of Object and Action Detectors

ICCV 2017
0
citations

Action Tubelet Detector for Spatio-Temporal Action Localization

ICCV 2017arXiv
0
citations

Learning Video Object Segmentation With Visual Memory

ICCV 2017arXiv
0
citations

Weakly-Supervised Learning of Visual Relations

ICCV 2017arXiv
0
citations

Detecting Unseen Visual Relations Using Analogies

ICCV 2019
0
citations

Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images

ICCV 2019
0
citations

Diversity With Cooperation: Ensemble Methods for Few-Shot Classification

ICCV 2019
0
citations

Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

ICCV 2019
0
citations

VideoBERT: A Joint Model for Video and Language Representation Learning

ICCV 2019
0
citations

Composable Augmentation Encoding for Video Representation Learning

ICCV 2021arXiv
0
citations

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

ICCV 2021arXiv
0
citations

Episodic Transformer for Vision-and-Language Navigation

ICCV 2021arXiv
0
citations

ViViT: A Video Vision Transformer

ICCV 2021arXiv
0
citations

Segmenter: Transformer for Semantic Segmentation

ICCV 2021arXiv
0
citations

Learning Temporal Dynamics From Cycles in Narrated Video

ICCV 2021arXiv
0
citations

Improving Robustness Against Common Corruptions With Frequency Biased Models

ICCV 2021arXiv
0
citations

Unified Graph Structured Models for Video Understanding

ICCV 2021arXiv
0
citations

Airbert: In-Domain Pretraining for Vision-and-Language Navigation

ICCV 2021arXiv
0
citations

Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts

ICCV 2023arXiv
0
citations

Verbs in Action: Improving Verb Understanding in Video-Language Models

ICCV 2023arXiv
0
citations

WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction

ICCV 2023arXiv
0
citations

UnLoc: A Unified Framework for Video Localization Tasks

ICCV 2023arXiv
0
citations

Audiovisual Masked Autoencoders

ICCV 2023arXiv
0
citations

Multi-modal Transformer for Video Retrieval

ECCV 2020
0
citations

TAO: A Large-Scale Benchmark for Tracking Any Object

ECCV 2020
0
citations

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

ECCV 2020
0
citations

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

ECCV 2020
0
citations

Memory-Efficient Incremental Learning Through Feature Adaptation

ECCV 2020
0
citations

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

ECCV 2022
0
citations

Learning Audio-Video Modalities from Image Captions

ECCV 2022
0
citations

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

ECCV 2022
0
citations

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

ECCV 2022
0
citations

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

CVPR 2018arXiv
0
citations

Flexible Frame Selection for Efficient Video Reasoning

CVPR 2025
0
citations

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

CVPR 2025
0
citations

Visual Lexicon: Rich Image Features in Language Space

CVPR 2025
0
citations

MINERVA: Evaluating Complex Video Reasoning

ICCV 2025
0
citations

Large-scale Pre-training for Grounded Video Caption Generation

ICCV 2025
0
citations

HORT: Monocular Hand-held Objects Reconstruction with Transformers

ICCV 2025
0
citations

CoVR: Learning Composed Video Retrieval from Web Video Captions

AAAI 2024
0
citations

Streaming Dense Video Captioning

CVPR 2024
0
citations

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

NeurIPS 2018
0
citations

A flexible model for training action localization with varying levels of supervision

NeurIPS 2018
0
citations

Adaptive Density Estimation for Generative Models

NeurIPS 2019
0
citations

What Makes for Good Views for Contrastive Learning?

NeurIPS 2020
0
citations

History Aware Multimodal Transformer for Vision-and-Language Navigation

NeurIPS 2021
0
citations

CCVS: Context-aware Controllable Video Synthesis

NeurIPS 2021
0
citations

Attention Bottlenecks for Multimodal Fusion

NeurIPS 2021
0
citations

Large-Scale Unsupervised Object Discovery

NeurIPS 2021
0
citations

Differentiable rendering with perturbed optimizers

NeurIPS 2021
0
citations

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

NeurIPS 2022
0
citations

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

NeurIPS 2022
0
citations

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

NeurIPS 2023
0
citations

Does Visual Pretraining Help End-to-End Reasoning?

NeurIPS 2023
0
citations

VidChapters-7M: Video Chapters at Scale

NeurIPS 2023
0
citations

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference

ICML 2019
0
citations