Wenhai Wang

40

Papers

2,577

Total Citations

Papers (40)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Docopilot: Improving Multimodal Models for Document-Level Understanding

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Planning-Oriented Autonomous Driving

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

DetCo: Unsupervised Contrastive Learning for Object Detection

FB-BEV: BEV Representation from Forward-Backward View Transformations

Scene Text Image Super-resolution in the wild

Segmenting Transparent Objects in the Wild

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

AVSegFormer: Audio-Visual Segmentation with Transformer

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Selective Kernel Networks

Shape Robust Text Detection With Progressive Scale Expansion Network

PolarMask: Single Shot Instance Segmentation With Polar Representation

Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks