Yu Qiao

176
Papers
6,176
Total Citations

Papers (176)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

VBench: Comprehensive Benchmark Suite for Video Generative Models

CVPR 2024
996
citations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024
396
citations

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

CVPR 2024
214
citations

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

ICLR 2024
209
citations

Conditional Sequential Modulation for Efficient Global Image Retouching

ECCV 2020
143
citations

Generalized Predictive Model for Autonomous Driving

CVPR 2024
122
citations

VideoBooth: Diffusion-based Video Generation with Image Prompts

CVPR 2024
118
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024
86
citations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

CVPR 2024
84
citations

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

CVPR 2024
76
citations

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

ICML 2025
72
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025
68
citations

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

ICCV 2025
58
citations

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

AAAI 2024arXiv
58
citations

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025
52
citations

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

ICLR 2024
46
citations

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

CVPR 2024
43
citations

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

ICCV 2025
35
citations

REEF: Representation Encoding Fingerprints for Large Language Models

ICLR 2025
31
citations

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

CVPR 2025
26
citations

An Intelligent Agentic System for Complex Image Restoration Problems

ICLR 2025
24
citations

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

CVPR 2024
20
citations

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

CVPR 2025
18
citations

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

ICLR 2024
15
citations

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

CVPR 2024
12
citations

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

ICLR 2025
11
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025
8
citations

OS-ATLAS: Foundation Action Model for Generalist GUI Agents

ICLR 2025
8
citations

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

ECCV 2024
8
citations

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

NeurIPS 2025
7
citations

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

CVPR 2024
7
citations

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

AAAI 2025
6
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025
5
citations

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

AAAI 2024arXiv
4
citations

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

ECCV 2024
3
citations

Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

ICCV 2025
2
citations

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

AAAI 2025
1
citations

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

NeurIPS 2025
1
citations

PA3D: Pose-Action 3D Machine for Video Recognition

CVPR 2019
0
citations

P2SGrad: Refined Gradients for Optimizing Deep Face Models

CVPR 2019
0
citations

AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

CVPR 2019
0
citations

Modulating Image Restoration With Continual Levels via Adaptive Feature Modification Layers

CVPR 2019
0
citations

COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification

CVPR 2020arXiv
0
citations

SmallBigNet: Integrating Core and Contextual Views for Video Classification

CVPR 2020arXiv
0
citations

Adaptive Dilated Network With Self-Correction Supervision for Counting

CVPR 2020
0
citations

Suppressing Uncertainties for Large-Scale Facial Expression Recognition

CVPR 2020arXiv
0
citations

Fast Texture Synthesis via Pseudo Optimizer

CVPR 2020
0
citations

Attention-Guided Hierarchical Structure Aggregation for Image Matting

CVPR 2020
0
citations

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

CVPR 2021arXiv
0
citations

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

CVPR 2021arXiv
0
citations

ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

CVPR 2021arXiv
0
citations

Detecting Human-Object Interaction via Fabricated Compositional Learning

CVPR 2021arXiv
0
citations

Affordance Transfer Learning for Human-Object Interaction Detection

CVPR 2021arXiv
0
citations

Reflash Dropout in Image Super-Resolution

CVPR 2022arXiv
0
citations

Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition

CVPR 2022
0
citations

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

CVPR 2022arXiv
0
citations

PointCLIP: Point Cloud Understanding by CLIP

CVPR 2022arXiv
0
citations

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

CVPR 2023arXiv
0
citations

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

CVPR 2023arXiv
0
citations

ResFormer: Scaling ViTs With Multi-Resolution Training

CVPR 2023arXiv
0
citations

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

CVPR 2023arXiv
0
citations

SCPNet: Semantic Scene Completion on Point Cloud

CVPR 2023arXiv
0
citations

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

CVPR 2023arXiv
0
citations

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

CVPR 2023arXiv
0
citations

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

CVPR 2023arXiv
0
citations

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

CVPR 2023arXiv
0
citations

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

CVPR 2023arXiv
0
citations

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

CVPR 2023
0
citations

Neural Transformation Fields for Arbitrary-Styled Font Generation

CVPR 2023
0
citations

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

CVPR 2023
0
citations

Siamese Image Modeling for Self-Supervised Vision Representation Learning

CVPR 2023arXiv
0
citations

Fine-Grained Audible Video Description

CVPR 2023arXiv
0
citations

Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection

CVPR 2023arXiv
0
citations

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

CVPR 2023arXiv
0
citations

Activating More Pixels in Image Super-Resolution Transformer

CVPR 2023arXiv
0
citations

Stare at What You See: Masked Image Modeling Without Reconstruction

CVPR 2023arXiv
0
citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv
0
citations

Planning-Oriented Autonomous Driving

CVPR 2023arXiv
0
citations

Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection

CVPR 2023arXiv
0
citations

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

CVPR 2023
0
citations

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

CVPR 2023
0
citations

DegAE: A New Pretraining Paradigm for Low-Level Vision

CVPR 2023
0
citations

Single Shot Text Detector With Regional Attention

ICCV 2017arXiv
0
citations

Detecting Faces Using Inside Cascaded Contextual CNN

ICCV 2017
0
citations

RPAN: An End-To-End Recurrent Pose-Attention Network for Action Recognition in Videos

ICCV 2017
0
citations

Range Loss for Deep Face Recognition With Long-Tailed Training Data

ICCV 2017
0
citations

DF2Net: A Dense-Fine-Finer Network for Detailed 3D Face Reconstruction

ICCV 2019
0
citations

RankSRGAN: Generative Adversarial Networks With Ranker for Image Super-Resolution

ICCV 2019
0
citations

Dynamic Multi-Scale Filters for Semantic Segmentation

ICCV 2019
0
citations

A New Journey From SDRTV to HDRTV

ICCV 2021arXiv
0
citations

Tripartite Information Mining and Integration for Image Matting

ICCV 2021
0
citations

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

ICCV 2023arXiv
0
citations

Multi-view Spectral Polarization Propagation for Video Glass Segmentation

ICCV 2023
0
citations

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

ICCV 2023
0
citations

MGMAE: Motion Guided Masking for Video Masked Autoencoding

ICCV 2023arXiv
0
citations

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

ICCV 2023arXiv
0
citations

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

ICCV 2023arXiv
0
citations

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

ICCV 2023arXiv
0
citations

Scaling Data Generation in Vision-and-Language Navigation

ICCV 2023arXiv
0
citations

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

ICCV 2023arXiv
0
citations

Rethinking Range View Representation for LiDAR Segmentation

ICCV 2023arXiv
0
citations

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

ICCV 2023arXiv
0
citations

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

ICCV 2023
0
citations

Visual Compositional Learning for Human-Object Interaction Detection

ECCV 2020
0
citations

Suppressing Mislabeled Data via Grouping and Self-Attention

ECCV 2020
0
citations

Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration

ECCV 2020
0
citations

Mining Inter-Video Proposal Relations for Video Object Detection

ECCV 2020
0
citations

Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition

ECCV 2020
0
citations

Learning to Predict Context-adaptive Convolution for Semantic Segmentation

ECCV 2020
0
citations

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

ECCV 2020
0
citations

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

ECCV 2022
0
citations

Self-Slimmed Vision Transformer

ECCV 2022
0
citations

PalGAN: Image Colorization with Palette Generative Adversarial Networks

ECCV 2022
0
citations

Recurrent Bilinear Optimization for Binary Neural Networks

ECCV 2022
0
citations

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

ECCV 2022
0
citations

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

ECCV 2022
0
citations

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

ECCV 2022
0
citations

Frozen CLIP Models Are Efficient Video Learners

ECCV 2022
0
citations

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

ECCV 2022
0
citations

PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

ECCV 2022
0
citations

Digging Into Uncertainty in Self-Supervised Multi-View Stereo

ICCV 2021arXiv
0
citations

All-Day Multi-Camera Multi-Target Tracking

CVPR 2025
0
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025
0
citations

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

CVPR 2025
0
citations

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

CVPR 2025
0
citations

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

ICCV 2025
0
citations

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

ICCV 2025
0
citations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI 2025
0
citations

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

AAAI 2024
0
citations

Critic-Guided Decision Transformer for Offline Reinforcement Learning

AAAI 2024
0
citations

M-BEV: Masked BEV Perception for Robust Autonomous Driving

AAAI 2024arXiv
0
citations

ConditionVideo: Training-Free Condition-Guided Video Generation

AAAI 2024
0
citations

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

AAAI 2024
0
citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024
0
citations

OneLLM: One Framework to Align All Modalities with Language

CVPR 2024
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

CVPR 2024
0
citations

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

CVPR 2024
0
citations

Point Transformer V3: Simpler Faster Stronger

CVPR 2024
0
citations

Vlogger: Make Your Dream A Vlog

CVPR 2024
0
citations

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

CVPR 2024
0
citations

ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

CVPR 2024
0
citations

Language-aware Visual Semantic Distillation for Video Question Answering

CVPR 2024
0
citations

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

CVPR 2024
0
citations

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

CVPR 2024
0
citations

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

CVPR 2024
0
citations

Unifying Image Processing as Visual Prompting Question Answering

ICML 2024
0
citations

Position: Towards Implicit Prompt For Text-To-Image Models

ICML 2024
0
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

CVPR 2015
0
citations

A Key Volume Mining Deep Framework for Action Recognition

CVPR 2016
0
citations

Actionness Estimation Using Hybrid Fully Convolutional Networks

CVPR 2016
0
citations

Real-Time Action Recognition With Enhanced Motion Vector CNNs

CVPR 2016
0
citations

Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition

CVPR 2016
0
citations

An End-to-End TextSpotter With Explicit Alignment and Attention

CVPR 2018arXiv
0
citations

Temporal Hallucinating for Action Recognition With Few Still Images

CVPR 2018
0
citations

FOTS: Fast Oriented Text Spotting With a Unified Network

CVPR 2018arXiv
0
citations

MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition

CVPR 2019
0
citations

Adaptive Pyramid Context Network for Semantic Segmentation

CVPR 2019
0
citations

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

NeurIPS 2022
0
citations

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

NeurIPS 2022
0
citations

MCMAE: Masked Convolution Meets Masked Autoencoders

NeurIPS 2022
0
citations

Real-World Image Super-Resolution as Multi-Task Learning

NeurIPS 2023
0
citations

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

NeurIPS 2023
0
citations

Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

NeurIPS 2023
0
citations

Foundation Model is Efficient Multimodal Multitask Model Selector

NeurIPS 2023
0
citations

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

NeurIPS 2023
0
citations

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

NeurIPS 2023
0
citations

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

NeurIPS 2023
0
citations

JourneyDB: A Benchmark for Generative Image Understanding

NeurIPS 2023
0
citations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

NeurIPS 2023
0
citations