Limin Wang

Google Scholar OpenReview

79

Papers

3,608

Total Citations

11

h-index

Papers (79)

VBench: Comprehensive Benchmark Suite for Video Generative Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

VideoMamba: State Space Model for Efficient Video Understanding

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

Multiple Object Tracking as ID Prediction

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Sparse Global Matching for Video Frame Interpolation with Large Motion

LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Scalable Image Tokenization with Index Backpropagation Quantization

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Tra-MoE: Learning Trajectory Prediction Model from Multiple Domains for Adaptive Policy Conditioning

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

Online Video Understanding: OVBench and VideoChat-Online

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Contextual AD Narration with Interleaved Multimodal Sequence

Make Your Training Flexible: Towards Deployment-Efficient Video Models

StreamForest: Efficient Online Video Understanding with Persistent Event Memory

Structured Sparse R-CNN for Direct Scene Graph Generation

Task-Specific Inconsistency Alignment for Domain Adaptive Object Detection

MixFormer: End-to-End Tracking With Iterative Mixed Attention

Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

STMixer: A One-Stage Sparse Action Detector

LinK: Linear Kernel for LiDAR-Based 3D Perception

Temporal Action Detection With Structured Segment Networks

LIP: Local Importance-Based Pooling

TAM: Temporal Adaptive Module for Video Recognition

Target Adaptive Context Aggregation for Video Scene Graph Generation

Mutual Supervision for Dense Object Detection

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

PyMAF: 3D Human Pose and Shape Regression With Pyramidal Mesh Alignment Feedback Loop

Self Supervision to Distillation for Long-Tailed Visual Recognition

MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions

Memory-and-Anticipation Transformer for Online Action Understanding

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

MGMAE: Motion Guided Masking for Video Masked Autoencoding

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

Efficient Video Action Detection with Token Dropout and Context Refinement

Deep Equilibrium Object Detection

StageInteractor: Query-based Object Detector with Cross-stage Interaction

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

Actions as Moving Points

Boundary-Aware Cascade Networks for Temporal Action Segmentation

Context-Aware RCNN: A Baseline for Action Detection in Videos

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

Relaxed Transformer Decoders for Direct Action Proposal Generation

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

Dual DETRs for Multi-Label Temporal Action Detection

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

Actionness Estimation Using Hybrid Fully Convolutional Networks

Real-Time Action Recognition With Enhanced Motion Vector CNNs

Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos

UntrimmedNets for Weakly Supervised Action Recognition and Detection

Appearance-and-Relation Networks for Video Classification

Learning Actor Relation Graphs for Group Activity Recognition

Translate-to-Recognize Networks for RGB-D Scene Recognition

TEA: Temporal Excitation and Aggregation for Action Recognition

SketchyCOCO: Image Generation From Freehand Scene Sketches

TDN: Temporal Difference Networks for Efficient Action Recognition

CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

OCSampler: Compressing Videos to One Clip With Single-Step Sampling

Cross-Architecture Self-Supervised Video Representation Learning

AdaMixer: A Fast-Converging Query-Based Object Detector

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

JourneyDB: A Benchmark for Generative Image Understanding

MixFormerV2: Efficient Fully Transformer Tracking