Yu Qiao

176

Papers

6,176

Total Citations

Papers (176)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

VBench: Comprehensive Benchmark Suite for Video Generative Models

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

VideoMamba: State Space Model for Efficient Video Understanding

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Conditional Sequential Modulation for Efficient Global Image Retouching

Generalized Predictive Model for Autonomous Driving

VideoBooth: Diffusion-based Video Generation with Image Prompts

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

REEF: Representation Encoding Fingerprints for Large Language Models

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding

An Intelligent Agentic System for Complex Image Restoration Problems

DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

OS-ATLAS: Foundation Action Model for Generalist GUI Agents

Within the Dynamic Context: Inertia-aware 3D Human Modeling with Pose Sequence

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

Mask as Supervision: Leveraging Unified Mask Information for Unsupervised 3D Pose Estimation

Towards Explicit Exoskeleton for the Reconstruction of Complicated 3D Human Avatars

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

PA3D: Pose-Action 3D Machine for Video Recognition

P2SGrad: Refined Gradients for Optimizing Deep Face Models

AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

Modulating Image Restoration With Continual Levels via Adaptive Feature Modification Layers

COCAS: A Large-Scale Clothes Changing Person Dataset for Re-Identification

SmallBigNet: Integrating Core and Contextual Views for Video Classification

Adaptive Dilated Network With Self-Correction Supervision for Counting

Suppressing Uncertainties for Large-Scale Facial Expression Recognition

Fast Texture Synthesis via Pseudo Optimizer

Attention-Guided Hierarchical Structure Aggregation for Image Matting

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

Detecting Human-Object Interaction via Fabricated Compositional Learning

Affordance Transfer Learning for Human-Object Interaction Detection

Reflash Dropout in Image Super-Resolution

Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

PointCLIP: Point Cloud Understanding by CLIP

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

ResFormer: Scaling ViTs With Multi-Resolution Training

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

SCPNet: Semantic Scene Completion on Point Cloud

VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Neural Transformation Fields for Arbitrary-Styled Font Generation

Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection

Siamese Image Modeling for Self-Supervised Vision Representation Learning

Fine-Grained Audible Video Description

Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection

Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior

Activating More Pixels in Image Super-Resolution Transformer

Stare at What You See: Masked Image Modeling Without Reconstruction

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Planning-Oriented Autonomous Driving

Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

DegAE: A New Pretraining Paradigm for Low-Level Vision

Single Shot Text Detector With Regional Attention

Detecting Faces Using Inside Cascaded Contextual CNN

RPAN: An End-To-End Recurrent Pose-Attention Network for Action Recognition in Videos

Range Loss for Deep Face Recognition With Long-Tailed Training Data

DF2Net: A Dense-Fine-Finer Network for Detailed 3D Face Reconstruction

RankSRGAN: Generative Adversarial Networks With Ranker for Image Super-Resolution

Dynamic Multi-Scale Filters for Semantic Segmentation

A New Journey From SDRTV to HDRTV

Tripartite Information Mining and Integration for Image Matting

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

Multi-view Spectral Polarization Propagation for Video Glass Segmentation

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

MGMAE: Motion Guided Masking for Video Masked Autoencoding

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

Scaling Data Generation in Vision-and-Language Navigation

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

Rethinking Range View Representation for LiDAR Segmentation

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation

Visual Compositional Learning for Human-Object Interaction Detection

Suppressing Mislabeled Data via Grouping and Self-Attention

Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration

Mining Inter-Video Proposal Relations for Video Object Detection

Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition

Learning to Predict Context-adaptive Convolution for Semantic Segmentation

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Self-Slimmed Vision Transformer

PalGAN: Image Colorization with Palette Generative Adversarial Networks

Recurrent Bilinear Optimization for Binary Neural Networks

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

Frozen CLIP Models Are Efficient Video Learners

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

Digging Into Uncertainty in Self-Supervised Multi-View Stereo

All-Day Multi-Camera Multi-Target Tracking

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation

Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

Muses: 3D-Controllable Image Generation via Multi-Modal Agent Collaboration

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

Critic-Guided Decision Transformer for Offline Reinforcement Learning

M-BEV: Masked BEV Perception for Robust Autonomous Driving

ConditionVideo: Training-Free Condition-Guided Video Generation

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

OneLLM: One Framework to Align All Modalities with Language

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Point Transformer V3: Simpler Faster Stronger

Vlogger: Make Your Dream A Vlog

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Language-aware Visual Semantic Distillation for Video Question Answering

Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction

Unifying Image Processing as Visual Prompting Question Answering

Position: Towards Implicit Prompt For Text-To-Image Models

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Action Recognition With Trajectory-Pooled Deep-Convolutional Descriptors

A Key Volume Mining Deep Framework for Action Recognition

Actionness Estimation Using Hybrid Fully Convolutional Networks

Real-Time Action Recognition With Enhanced Motion Vector CNNs

Latent Factor Guided Convolutional Neural Networks for Age-Invariant Face Recognition

An End-to-End TextSpotter With Explicit Alignment and Attention

Temporal Hallucinating for Action Recognition With Few Still Images

FOTS: Fast Oriented Text Spotting With a Unified Network

MetaCleaner: Learning to Hallucinate Clean Representations for Noisy-Labeled Visual Recognition

Adaptive Pyramid Context Network for Semantic Segmentation

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

MCMAE: Masked Convolution Meets Masked Autoencoders

Real-World Image Super-Resolution as Multi-Task Learning

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Networks are Slacking Off: Understanding Generalization Problem in Image Deraining

Foundation Model is Efficient Multimodal Multitask Model Selector

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

JourneyDB: A Benchmark for Generative Image Understanding

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks