79
Papers
3,608
Total Citations
11
h-index

Papers (79)

VBench: Comprehensive Benchmark Suite for Video Generative Models

CVPR 2024
996
citations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

CVPR 2024
589
citations

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

ICLR 2024
408
citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024
396
citations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

CVPR 2024
84
citations

Multiple Object Tracking as ID Prediction

CVPR 2025arXiv
53
citations

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

ICLR 2025
39
citations

Sparse Global Matching for Video Frame Interpolation with Large Motion

CVPR 2024
27
citations

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

CVPR 2025arXiv
25
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025arXiv
19
citations

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

CVPR 2024
17
citations

Scalable Image Tokenization with Index Backpropagation Quantization

ICCV 2025
16
citations

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

CVPR 2024
12
citations

Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning

CVPR 2025
11
citations

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

ICLR 2025
11
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

Online Video Understanding: OVBench and VideoChat-Online

CVPR 2025arXiv
9
citations

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025
8
citations

Contextual AD Narration with Interleaved Multimodal Sequence

CVPR 2025arXiv
7
citations

Make Your Training Flexible: Towards Deployment-Efficient Video Models

ICCV 2025
5
citations

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

NeurIPS 2025
3
citations

Structured Sparse R-CNN for Direct Scene Graph Generation

CVPR 2022
0
citations

Task-Specific Inconsistency Alignment for Domain Adaptive Object Detection

CVPR 2022arXiv
0
citations

MixFormer: End-to-End Tracking With Iterative Mixed Attention

CVPR 2022arXiv
0
citations

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

CVPR 2023arXiv
0
citations

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

CVPR 2023arXiv
0
citations

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

CVPR 2023arXiv
0
citations

STMixer: A One-Stage Sparse Action Detector

CVPR 2023arXiv
0
citations

LinK: Linear Kernel for LiDAR-Based 3D Perception

CVPR 2023arXiv
0
citations

Temporal Action Detection With Structured Segment Networks

ICCV 2017arXiv
0
citations

LIP: Local Importance-Based Pooling

ICCV 2019
0
citations

TAM: Temporal Adaptive Module for Video Recognition

ICCV 2021arXiv
0
citations

Target Adaptive Context Aggregation for Video Scene Graph Generation

ICCV 2021arXiv
0
citations

Mutual Supervision for Dense Object Detection

ICCV 2021arXiv
0
citations

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

ICCV 2021arXiv
0
citations

PyMAF: 3D Human Pose and Shape Regression With Pyramidal Mesh Alignment Feedback Loop

ICCV 2021arXiv
0
citations

Self Supervision to Distillation for Long-Tailed Visual Recognition

ICCV 2021arXiv
0
citations

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

ICCV 2021arXiv
0
citations

Memory-and-Anticipation Transformer for Online Action Understanding

ICCV 2023arXiv
0
citations

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

ICCV 2023
0
citations

MGMAE: Motion Guided Masking for Video Masked Autoencoding

ICCV 2023arXiv
0
citations

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

ICCV 2023arXiv
0
citations

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

ICCV 2023arXiv
0
citations

Efficient Video Action Detection with Token Dropout and Context Refinement

ICCV 2023arXiv
0
citations

Deep Equilibrium Object Detection

ICCV 2023arXiv
0
citations

StageInteractor: Query-based Object Detector with Cross-stage Interaction

ICCV 2023arXiv
0
citations

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

ICCV 2023arXiv
0
citations

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023arXiv
0
citations

Actions as Moving Points

ECCV 2020
0
citations

Boundary-Aware Cascade Networks for Temporal Action Segmentation

ECCV 2020
0
citations

Context-Aware RCNN: A Baseline for Action Detection in Videos

ECCV 2020
0
citations

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

ECCV 2022
0
citations

Relaxed Transformer Decoders for Direct Action Proposal Generation

ICCV 2021arXiv
0
citations

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

ICCV 2025
0
citations

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

ICCV 2025arXiv
0
citations

Dual DETRs for Multi-Label Temporal Action Detection

CVPR 2024
0
citations

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

CVPR 2024
0
citations

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

CVPR 2024
0
citations

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

CVPR 2015
0
citations

Actionness Estimation Using Hybrid Fully Convolutional Networks

CVPR 2016
0
citations

Real-Time Action Recognition With Enhanced Motion Vector CNNs

CVPR 2016
0
citations

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

CVPR 2017arXiv
0
citations

UntrimmedNets for Weakly Supervised Action Recognition and Detection

CVPR 2017arXiv
0
citations

Appearance-and-Relation Networks for Video Classification

CVPR 2018arXiv
0
citations

Learning Actor Relation Graphs for Group Activity Recognition

CVPR 2019
0
citations

Translate-to-Recognize Networks for RGB-D Scene Recognition

CVPR 2019
0
citations

TEA: Temporal Excitation and Aggregation for Action Recognition

CVPR 2020arXiv
0
citations

SketchyCOCO: Image Generation From Freehand Scene Sketches

CVPR 2020arXiv
0
citations

TDN: Temporal Difference Networks for Efficient Action Recognition

CVPR 2021arXiv
0
citations

CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation

CVPR 2021
0
citations

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

CVPR 2022arXiv
0
citations

OCSampler: Compressing Videos to One Clip With Single-Step Sampling

CVPR 2022arXiv
0
citations

Cross-Architecture Self-Supervised Video Representation Learning

CVPR 2022arXiv
0
citations

AdaMixer: A Fast-Converging Query-Based Object Detector

CVPR 2022arXiv
0
citations

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

NeurIPS 2022
0
citations

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

NeurIPS 2022
0
citations

JourneyDB: A Benchmark for Generative Image Understanding

NeurIPS 2023
0
citations

MixFormerV2: Efficient Fully Transformer Tracking

NeurIPS 2023
0
citations