Xizhou Zhu

34

Papers

2,613

Total Citations

Papers (34)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

Unsupervised Object Detection With LIDAR Clues

Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks

AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks

Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework

Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Siamese Image Modeling for Self-Supervised Vision Representation Learning

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Planning-Oriented Autonomous Driving

Flow-Guided Feature Aggregation for Video Object Detection

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation

DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

LangBridge: Interpreting Image as a Combination of Language Embeddings

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Deep Feature Flow for Video Recognition

Towards High Performance Video Object Detection

Deformable ConvNets V2: More Deformable, Better Results

Searching Parameterized AP Loss for Object Detection

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks