Cordelia Schmid

21

Papers

95

Total Citations

Papers (21)

Learning Correlation Structures for Vision Transformers

Language-Guided Image Tokenization for Generation

DataDream: Few-shot Guided Dataset Generation

Flexible Frame Selection for Efficient Video Reasoning

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

CoVR: Learning Composed Video Retrieval from Web Video Captions

Streaming Dense Video Captioning

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Dense Optical Tracking: Connecting the Dots

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

SUGAR: Pre-training 3D Visual Representations for Robotics

Pixel-Aligned Language Model

Time- Memory- and Parameter-Efficient Visual Adaptation

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Visual Lexicon: Rich Image Features in Language Space

MINERVA: Evaluating Complex Video Reasoning

Large-scale Pre-training for Grounded Video Caption Generation

HORT: Monocular Hand-held Objects Reconstruction with Transformers