Weidi Xie

43

Papers

374

Total Citations

Papers (43)

Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval

Grounded Question-Answering in Long Egocentric Videos

AutoAD III: The Prequel – Back to the Pixels

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation

Track-On: Transformer-based Online Point Tracking with Memory

Towards Universal Soccer Video Understanding

Multi-Sentence Grounding for Long-term Instructional Video

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

Learning Streaming Video Representation via Multitask Training

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

Collaboration Helps Camera Overtake LiDAR in 3D Detection

OvarNet: Towards Open-Vocabulary Object Attribute Recognition

AutoAD: Movie Description in Context

Self-Supervised Video Object Segmentation by Motion Grouping

MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description

Joint-Relation Transformer for Multi-Person Motion Prediction

The Making and Breaking of Camouflage

LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant

Open-vocabulary Object Segmentation with Diffusion Models

Memory-augmented Dense Predictive Coding for Video Representation Learning

PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images

Prompting Visual-Language Models for Efficient Video Understanding

Towards Open-Vocabulary Video Instance Segmentation

Object-centric Video Question Answering with Visual Grounding and Referring

MRGen: Segmentation Data Engine For Underrepresented MRI Modalities

Retrieval-Augmented Egocentric Video Captioning

Amodal Ground Truth and Completion in the Wild

Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

MAST: A Memory-Augmented Self-Supervised Tracker

Localizing Visual Sounds the Hard Way

Temporal Alignment Networks for Long-Term Video

It's About Time: Analog Clock Reading in the Wild

Label, Verify, Correct: A Simple Few Shot Object Detection Method

Self-supervised Co-Training for Video Representation Learning

Associating Objects and Their Effects in Video through Coordination Games

Segmenting Moving Objects via an Object-Centric Layered Representation

ReCo: Retrieve and Co-segment for Zero-shot Transfer

Self-supervised Object-Centric Learning for Videos