Cordelia Schmid

112

Papers

685

Total Citations

Papers (112)

Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

NeurIPS 2016arXiv

Learning Correlation Structures for Vision Transformers

Graph convolutional networks for learning with few clean and many noisy labels

DataDream: Few-shot Guided Dataset Generation

Language-Guided Image Tokenization for Generation

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Consistency Guided Scene Flow Estimation

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Dense Optical Tracking: Connecting the Dots

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

SUGAR: Pre-training 3D Visual Representations for Robotics

Pixel-Aligned Language Model

Time- Memory- and Parameter-Efficient Visual Adaptation

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow

Learning to Detect Motion Boundaries

Proposal Flow

Learning From Synthetic Humans

Learning Motion Patterns in Videos

LCR-Net: Localization-Classification-Regression for Human Pose

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

PoTion: Pose MoTion Representation for Action Recognition

Actor and Observer: Joint Modeling of First and Third-Person Videos

Relational Action Forecasting

MARS: Motion-Augmented RGB Stream for Action Recognition

A Structured Model for Action Detection

Learning Joint Reconstruction of Hands and Manipulated Objects

Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction

Speech2Action: Cross-Modal Supervision for Action Recognition

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Look Before You Speak: Visually Contextualized Utterances

End-to-End Generative Pretraining for Multimodal Video Captioning

Multiview Transformers for Video Recognition

Learning With Neighbor Consistency for Noisy Labels

Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation

TubeDETR: Spatio-Temporal Video Grounding With Transformers

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Improving Image Recognition by Retrieving From Web-Scale Image-Text Data

Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR

How Can Objects Help Action Recognition?

gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction

Local Convolutional Features With Unsupervised Training for Image Retrieval

Online Object Tracking With Proposal Selection

Learning to Track for Spatio-Temporal Action Localization

Unsupervised Object Discovery and Tracking in Video Collections

P-CNN: Pose-Based CNN Features for Action Recognition

Weakly-Supervised Alignment of Video With Text

Areas of Attention for Image Captioning

SCNet: Learning Semantic Correspondence

Incremental Learning of Object Detectors Without Catastrophic Forgetting

BlitzNet: A Real-Time Deep Network for Scene Understanding

Joint Learning of Object and Action Detectors

Action Tubelet Detector for Spatio-Temporal Action Localization

Learning Video Object Segmentation With Visual Memory

Weakly-Supervised Learning of Visual Relations

Detecting Unseen Visual Relations Using Analogies

Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images

Diversity With Cooperation: Ensemble Methods for Few-Shot Classification

Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera

VideoBERT: A Joint Model for Video and Language Representation Learning

Composable Augmentation Encoding for Video Representation Learning

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Episodic Transformer for Vision-and-Language Navigation

ViViT: A Video Vision Transformer

Segmenter: Transformer for Semantic Segmentation

Learning Temporal Dynamics From Cycles in Narrated Video

Improving Robustness Against Common Corruptions With Frequency Biased Models

Unified Graph Structured Models for Video Understanding

Airbert: In-Domain Pretraining for Vision-and-Language Navigation

Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts

Verbs in Action: Improving Verb Understanding in Video-Language Models

WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction

UnLoc: A Unified Framework for Video Localization Tasks

Audiovisual Masked Autoencoders

Multi-modal Transformer for Video Retrieval

TAO: A Large-Scale Benchmark for Tracking Any Object

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification

Memory-Efficient Incremental Learning Through Feature Adaptation

AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction

Learning Audio-Video Modalities from Image Captions

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Flexible Frame Selection for Efficient Video Reasoning

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Visual Lexicon: Rich Image Features in Language Space

MINERVA: Evaluating Complex Video Reasoning

Large-scale Pre-training for Grounded Video Caption Generation

HORT: Monocular Hand-held Objects Reconstruction with Transformers

CoVR: Learning Composed Video Retrieval from Web Video Captions

Streaming Dense Video Captioning

Unsupervised Learning of Artistic Styles with Archetypal Style Analysis

A flexible model for training action localization with varying levels of supervision

Adaptive Density Estimation for Generative Models

What Makes for Good Views for Contrastive Learning?

History Aware Multimodal Transformer for Vision-and-Language Navigation

CCVS: Context-aware Controllable Video Synthesis

Attention Bottlenecks for Multimodal Fusion

Large-Scale Unsupervised Object Discovery

Differentiable rendering with perturbed optimizers

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

Does Visual Pretraining Help End-to-End Reasoning?

VidChapters-7M: Video Chapters at Scale

White-box vs Black-box: Bayes Optimal Strategies for Membership Inference