Xizhou Zhu

13

Papers

2,613

Total Citations

Papers (13)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

LangBridge: Interpreting Image as a Combination of Language Embeddings

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications