Yu Qiao
70
Papers
6,052
Total Citations
Papers (70)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
2,210
citations
VBench: Comprehensive Benchmark Suite for Video Generative Models
CVPR 2024
996
citations
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
CVPR 2024
864
citations
VideoMamba: State Space Model for Efficient Video Understanding
ECCV 2024
396
citations
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
CVPR 2024
214
citations
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction
ICLR 2024
209
citations
Generalized Predictive Model for Autonomous Driving
CVPR 2024
122
citations
VideoBooth: Diffusion-based Video Generation with Image Prompts
CVPR 2024
118
citations
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024
86
citations
EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World
CVPR 2024
84
citations
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2024
76
citations
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
ICML 2025
72
citations
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
CVPR 2025
68
citations
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
ICCV 2025
58
citations
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
AAAI 2024arXiv
58
citations
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
52
citations
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
ICLR 2024
46
citations
Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision
CVPR 2024
43
citations
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
ICCV 2025
35
citations
REEF: Representation Encoding Fingerprints for Large Language Models
ICLR 2025
31
citations
SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding
CVPR 2025
26
citations
An Intelligent Agentic System for Complex Image Restoration Problems
ICLR 2025
24
citations
DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement
CVPR 2024
20
citations
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
CVPR 2025arXiv
19
citations
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
CVPR 2025
18
citations
CO2: Efficient Distributed Training with Full Communication-Computation Overlap
ICLR 2024
15
citations
Asymmetric Masked Distillation for Pre-Training Small Foundation Models
CVPR 2024
12
citations
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
ICLR 2025
11
citations
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
ICLR 2025
9
citations
OS-ATLAS: Foundation Action Model for Generalist GUI Agents
ICLR 2025
8
citations
Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence
ECCV 2024arXiv
8
citations
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
ICCV 2025
8
citations
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
CVPR 2024
7
citations
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
NeurIPS 2025
7
citations
H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
AAAI 2025
6
citations
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
CVPR 2025
5
citations
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
AAAI 2024arXiv
4
citations
Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation
ECCV 2024
3
citations
Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars
ICCV 2025
2
citations
GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction
AAAI 2025
1
citations
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
NeurIPS 2025
1
citations
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model
AAAI 2024
0
citations
Point Transformer V3: Simpler Faster Stronger
CVPR 2024
0
citations
ConditionVideo: Training-Free Condition-Guided Video Generation
AAAI 2024
0
citations
M-BEV: Masked BEV Perception for Robust Autonomous Driving
AAAI 2024arXiv
0
citations
Critic-Guided Decision Transformer for Offline Reinforcement Learning
AAAI 2024
0
citations
Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption
AAAI 2024
0
citations
Vlogger: Make Your Dream A Vlog
CVPR 2024
0
citations
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
CVPR 2024
0
citations
ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring
CVPR 2024
0
citations
Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration
AAAI 2025
0
citations
Language-aware Visual Semantic Distillation for Video Question Answering
CVPR 2024
0
citations
Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models
CVPR 2024
0
citations
DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations
ICCV 2025
0
citations
DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation
CVPR 2024
0
citations
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation
ICCV 2025
0
citations
LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction
CVPR 2024
0
citations
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
CVPR 2025
0
citations
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
CVPR 2025
0
citations
All-Day Multi-Camera Multi-Target Tracking
CVPR 2025
0
citations
Unifying Image Processing as Visual Prompting Question Answering
ICML 2024
0
citations
Position: Towards Implicit Prompt For Text-To-Image Models
ICML 2024
0
citations
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
ICML 2024
0
citations
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024
0
citations
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
ICML 2024
0
citations
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
CVPR 2024
0
citations
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024
0
citations
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
0
citations
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
CVPR 2024
0
citations
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
CVPR 2024
0
citations