Haoqi Fan

24

Papers

15

Total Citations

Papers (24)

Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning

Going Deeper into First-Person Activity Recognition

Stacked Latent Attention for Multimodal Reasoning

Long-Term Feature Banks for Detailed Video Understanding

Momentum Contrast for Unsupervised Visual Representation Learning

Beyond Short Clips: End-to-End Video-Level Learning With Collaborative Memories

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Reversible Vision Transformers

Unified Transformer Tracker for Object Tracking

Masked Feature Prediction for Self-Supervised Visual Pre-Training

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

On the Importance of Asymmetry for Siamese Representation Learning

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

Scaling Language-Image Pre-Training via Masking

Order-Aware Generative Modeling Using the 3D-Craft Dataset

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks With Octave Convolution

SlowFast Networks for Video Recognition

Multiscale Vision Transformers

HiT: Hierarchical Transformer With Momentum Contrast for Video-Text Retrieval

Multiview Pseudo-Labeling for Semi-Supervised Learning From Video

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

LLaVA-Critic: Learning to Evaluate Multimodal Models

Diffusion Models as Masked Autoencoders