Yu Qiao
176
Papers
6,176
Total Citations
Papers (176)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
2,210
citations
VBench: Comprehensive Benchmark Suite for Video Generative Models
CVPR 2024
996
citations
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024
864
citations
VideoMamba: State Space Model for Efficient Video Understanding
ECCV 2024
396
citations
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
CVPR 2024
214
citations
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
ICLR 2024
209
citations
Conditional Sequential Modulation for Efficient Global Image Retouching
ECCV 2020
143
citations
Generalized Predictive Model for Autonomous Driving
CVPR 2024
122
citations
VideoBooth: Diffusion-based Video Generation with Image Prompts
CVPR 2024
118
citations
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024
86
citations
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
CVPR 2024
84
citations
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2024
76
citations
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
ICML 2025
72
citations
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
CVPR 2025
68
citations
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
ICCV 2025
58
citations
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
AAAI 2024arXiv
58
citations
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
52
citations
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
ICLR 2024
46
citations
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
CVPR 2024
43
citations
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
ICCV 2025
35
citations
REEF: Representation Encoding Fingerprints for Large Language Models
ICLR 2025
31
citations
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
CVPR 2025
26
citations
An Intelligent Agentic System for Complex Image Restoration Problems
ICLR 2025
24
citations
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
CVPR 2024
20
citations
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
CVPR 2025
18
citations
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
ICLR 2024
15
citations
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
CVPR 2024
12
citations
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
ICLR 2025
11
citations
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
ICLR 2025
9
citations
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
8
citations
OS-ATLAS: Foundation Action Model for Generalist GUI Agents
ICLR 2025
8
citations
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
ECCV 2024
8
citations
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
NeurIPS 2025
7
citations
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024
7
citations
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
AAAI 2025
6
citations
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
CVPR 2025
5
citations
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
AAAI 2024arXiv
4
citations
Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
ECCV 2024
3
citations
Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
ICCV 2025
2
citations
GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction
AAAI 2025
1
citations
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
NeurIPS 2025
1
citations
PA3D: Pose-Action 3D Machine for Video Recognition
CVPR 2019
0
citations
P2SGrad: Refined Gradients for Optimizing Deep Face Models
CVPR 2019
0
citations
AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations
CVPR 2019
0
citations
Modulating Image Restoration With Continual Levels via Adaptive Feature Modification Layers
CVPR 2019
0
citations
COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification
CVPR 2020arXiv
0
citations
SmallBigNet: Integrating Core and Contextual Views for Video Classification
CVPR 2020arXiv
0
citations
Adaptive Dilated Network With Self-Correction Supervision for Counting
CVPR 2020
0
citations
Suppressing Uncertainties for Large-Scale Facial Expression Recognition
CVPR 2020arXiv
0
citations
Fast Texture Synthesis via Pseudo Optimizer
CVPR 2020
0
citations
Attention-Guided Hierarchical Structure Aggregation for Image Matting
CVPR 2020
0
citations
Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification
CVPR 2021arXiv
0
citations
Temporal Context Aggregation Network for Temporal Action Proposal Refinement
CVPR 2021arXiv
0
citations
ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic
CVPR 2021arXiv
0
citations
Detecting Human-Object Interaction via Fabricated Compositional Learning
CVPR 2021arXiv
0
citations
Affordance Transfer Learning for Human-Object Interaction Detection
CVPR 2021arXiv
0
citations
Reflash Dropout in Image Super-Resolution
CVPR 2022arXiv
0
citations
Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition
CVPR 2022
0
citations
Cross Domain Object Detection by Target-Perceived Dual Branch Distillation
CVPR 2022arXiv
0
citations
PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022arXiv
0
citations
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
CVPR 2023arXiv
0
citations
CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP
CVPR 2023arXiv
0
citations
ResFormer: Scaling ViTs With Multi-Resolution Training
CVPR 2023arXiv
0
citations
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
CVPR 2023arXiv
0
citations
SCPNet: Semantic Scene Completion on Point Cloud
CVPR 2023arXiv
0
citations
VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking
CVPR 2023arXiv
0
citations
Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision
CVPR 2023arXiv
0
citations
LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion
CVPR 2023arXiv
0
citations
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023arXiv
0
citations
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
CVPR 2023arXiv
0
citations
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
CVPR 2023
0
citations
Neural Transformation Fields for Arbitrary-Styled Font Generation
CVPR 2023
0
citations
Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection
CVPR 2023
0
citations
Siamese Image Modeling for Self-Supervised Vision Representation Learning
CVPR 2023arXiv
0
citations
Fine-Grained Audible Video Description
CVPR 2023arXiv
0
citations
Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection
CVPR 2023arXiv
0
citations
Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior
CVPR 2023arXiv
0
citations
Activating More Pixels in Image Super-Resolution Transformer
CVPR 2023arXiv
0
citations
Stare at What You See: Masked Image Modeling Without Reconstruction
CVPR 2023arXiv
0
citations
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
CVPR 2023arXiv
0
citations
Planning-Oriented Autonomous Driving
CVPR 2023arXiv
0
citations
Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection
CVPR 2023arXiv
0
citations
Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions
CVPR 2023
0
citations
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency
CVPR 2023
0
citations
DegAE: A New Pretraining Paradigm for Low-Level Vision
CVPR 2023
0
citations
Single Shot Text Detector With Regional Attention
ICCV 2017arXiv
0
citations
Detecting Faces Using Inside Cascaded Contextual CNN
ICCV 2017
0
citations
RPAN: An End-To-End Recurrent Pose-Attention Network for Action Recognition in Videos
ICCV 2017
0
citations
Range Loss for Deep Face Recognition With Long-Tailed Training Data
ICCV 2017
0
citations
DF2Net: A Dense-Fine-Finer Network for Detailed 3D Face Reconstruction
ICCV 2019
0
citations
RankSRGAN: Generative Adversarial Networks With Ranker for Image Super-Resolution
ICCV 2019
0
citations
Dynamic Multi-Scale Filters for Semantic Segmentation
ICCV 2019
0
citations
A New Journey From SDRTV to HDRTV
ICCV 2021arXiv
0
citations
Tripartite Information Mining and Integration for Image Matting
ICCV 2021
0
citations
UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase
ICCV 2023arXiv
0
citations
Multi-view Spectral Polarization Propagation for Video Glass Segmentation
ICCV 2023
0
citations
UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding
ICCV 2023
0
citations
MGMAE: Motion Guided Masking for Video Masked Autoencoding
ICCV 2023arXiv
0
citations
DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
ICCV 2023arXiv
0
citations
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023arXiv
0
citations
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers
ICCV 2023arXiv
0
citations
Scaling Data Generation in Vision-and-Language Navigation
ICCV 2023arXiv
0
citations
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
ICCV 2023arXiv
0
citations
Rethinking Range View Representation for LiDAR Segmentation
ICCV 2023arXiv
0
citations
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
ICCV 2023arXiv
0
citations
HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation
ICCV 2023
0
citations
Visual Compositional Learning for Human-Object Interaction Detection
ECCV 2020
0
citations
Suppressing Mislabeled Data via Grouping and Self-Attention
ECCV 2020
0
citations
Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration
ECCV 2020
0
citations
Mining Inter-Video Proposal Relations for Video Object Detection
ECCV 2020
0
citations
Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition
ECCV 2020
0
citations
Learning to Predict Context-adaptive Convolution for Semantic Segmentation
ECCV 2020
0
citations
RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax
ECCV 2020
0
citations
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
ECCV 2022
0
citations
Self-Slimmed Vision Transformer
ECCV 2022
0
citations
PalGAN: Image Colorization with Palette Generative Adversarial Networks
ECCV 2022
0
citations
Recurrent Bilinear Optimization for Binary Neural Networks
ECCV 2022
0
citations
VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
ECCV 2022
0
citations
X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation
ECCV 2022
0
citations
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
ECCV 2022
0
citations
Frozen CLIP Models Are Efficient Video Learners
ECCV 2022
0
citations
Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification
ECCV 2022
0
citations
PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark
ECCV 2022
0
citations
Digging Into Uncertainty in Self-Supervised Multi-View Stereo
ICCV 2021arXiv
0
citations
All-Day Multi-Camera Multi-Target Tracking
CVPR 2025
0
citations
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025
0
citations
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
CVPR 2025
0
citations
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
CVPR 2025
0
citations
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
ICCV 2025
0
citations
DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
ICCV 2025
0
citations
Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
AAAI 2025
0
citations
Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption
AAAI 2024
0
citations
Critic-Guided Decision Transformer for Offline Reinforcement Learning
AAAI 2024
0
citations
M-BEV: Masked BEV Perception for Robust Autonomous Driving
AAAI 2024arXiv
0
citations
ConditionVideo: Training-Free Condition-Guided Video Generation
AAAI 2024
0
citations
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model
AAAI 2024
0
citations
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
CVPR 2024
0
citations
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024
0
citations
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
0
citations
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
CVPR 2024
0
citations
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR 2024
0
citations
Point Transformer V3: Simpler Faster Stronger
CVPR 2024
0
citations
Vlogger: Make Your Dream A Vlog
CVPR 2024
0
citations
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
CVPR 2024
0
citations
ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring
CVPR 2024
0
citations
Language-aware Visual Semantic Distillation for Video Question Answering
CVPR 2024
0
citations
Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models
CVPR 2024
0
citations
DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation
CVPR 2024
0
citations
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
CVPR 2024
0
citations
Unifying Image Processing as Visual Prompting Question Answering
ICML 2024
0
citations
Position: Towards Implicit Prompt For Text-To-Image Models
ICML 2024
0
citations
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
ICML 2024
0
citations
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024
0
citations
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
ICML 2024
0
citations
Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors
CVPR 2015
0
citations
A Key Volume Mining Deep Framework for Action Recognition
CVPR 2016
0
citations
Actionness Estimation Using Hybrid Fully Convolutional Networks
CVPR 2016
0
citations
Real-Time Action Recognition With Enhanced Motion Vector CNNs
CVPR 2016
0
citations
Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition
CVPR 2016
0
citations
An End-to-End TextSpotter With Explicit Alignment and Attention
CVPR 2018arXiv
0
citations
Temporal Hallucinating for Action Recognition With Few Still Images
CVPR 2018
0
citations
FOTS: Fast Oriented Text Spotting With a Unified Network
CVPR 2018arXiv
0
citations
MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition
CVPR 2019
0
citations
Adaptive Pyramid Context Network for Semantic Segmentation
CVPR 2019
0
citations
Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline
NeurIPS 2022
0
citations
Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
NeurIPS 2022
0
citations
MCMAE: Masked Convolution Meets Masked Autoencoders
NeurIPS 2022
0
citations
Real-World Image Super-Resolution as Multi-Task Learning
NeurIPS 2023
0
citations
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
NeurIPS 2023
0
citations
Networks are Slacking Off: Understanding Generalization Problem in Image Deraining
NeurIPS 2023
0
citations
Foundation Model is Efficient Multimodal Multitask Model Selector
NeurIPS 2023
0
citations
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
NeurIPS 2023
0
citations
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
NeurIPS 2023
0
citations
AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset
NeurIPS 2023
0
citations
JourneyDB: A Benchmark for Generative Image Understanding
NeurIPS 2023
0
citations
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023
0
citations