Yu Qiao

70
Papers
6,052
Total Citations

Papers (70)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

VBench: Comprehensive Benchmark Suite for Video Generative Models

CVPR 2024
996
citations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

VideoMamba: State Space Model for Efficient Video Understanding

ECCV 2024
396
citations

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

CVPR 2024
214
citations

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

ICLR 2024
209
citations

Generalized Predictive Model for Autonomous Driving

CVPR 2024
122
citations

VideoBooth: Diffusion-based Video Generation with Image Prompts

CVPR 2024
118
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024
86
citations

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

CVPR 2024
84
citations

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

CVPR 2024
76
citations

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

ICML 2025
72
citations

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025
68
citations

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

ICCV 2025
58
citations

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

AAAI 2024arXiv
58
citations

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025
52
citations

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

ICLR 2024
46
citations

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

CVPR 2024
43
citations

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

ICCV 2025
35
citations

REEF: Representation Encoding Fingerprints for Large Language Models

ICLR 2025
31
citations

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

CVPR 2025
26
citations

An Intelligent Agentic System for Complex Image Restoration Problems

ICLR 2025
24
citations

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

CVPR 2024
20
citations

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

CVPR 2025arXiv
19
citations

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

CVPR 2025
18
citations

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

ICLR 2024
15
citations

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

CVPR 2024
12
citations

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

ICLR 2025
11
citations

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

ICLR 2025
9
citations

OS-ATLAS: Foundation Action Model for Generalist GUI Agents

ICLR 2025
8
citations

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

ECCV 2024arXiv
8
citations

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

ICCV 2025
8
citations

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

CVPR 2024
7
citations

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

NeurIPS 2025
7
citations

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

AAAI 2025
6
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025
5
citations

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

AAAI 2024arXiv
4
citations

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

ECCV 2024
3
citations

Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

ICCV 2025
2
citations

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

AAAI 2025
1
citations

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

NeurIPS 2025
1
citations

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

AAAI 2024
0
citations

Point Transformer V3: Simpler Faster Stronger

CVPR 2024
0
citations

ConditionVideo: Training-Free Condition-Guided Video Generation

AAAI 2024
0
citations

M-BEV: Masked BEV Perception for Robust Autonomous Driving

AAAI 2024arXiv
0
citations

Critic-Guided Decision Transformer for Offline Reinforcement Learning

AAAI 2024
0
citations

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

AAAI 2024
0
citations

Vlogger: Make Your Dream A Vlog

CVPR 2024
0
citations

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

CVPR 2024
0
citations

ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

CVPR 2024
0
citations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

AAAI 2025
0
citations

Language-aware Visual Semantic Distillation for Video Question Answering

CVPR 2024
0
citations

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

CVPR 2024
0
citations

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

ICCV 2025
0
citations

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

CVPR 2024
0
citations

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

ICCV 2025
0
citations

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

CVPR 2024
0
citations

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

CVPR 2025
0
citations

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

CVPR 2025
0
citations

All-Day Multi-Camera Multi-Target Tracking

CVPR 2025
0
citations

Unifying Image Processing as Visual Prompting Question Answering

ICML 2024
0
citations

Position: Towards Implicit Prompt For Text-To-Image Models

ICML 2024
0
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024
0
citations

OneLLM: One Framework to Align All Modalities with Language

CVPR 2024
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

CVPR 2024
0
citations

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

CVPR 2024
0
citations