Wenhai Wang

40
Papers
2,577
Total Citations

Papers (40)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

ECCV 2020
138
citations

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

ECCV 2024
86
citations

ControlLLM: Augment Language Models with Tools by Searching on Graphs

ECCV 2024arXiv
57
citations

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025
52
citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025
14
citations

Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction

ICCV 2025
6
citations

MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost

ICML 2025
6
citations

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

CVPR 2025
5
citations

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

NeurIPS 2025
2
citations

Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings

NeurIPS 2025
1
citations

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

CVPR 2023arXiv
0
citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv
0
citations

Planning-Oriented Autonomous Driving

CVPR 2023arXiv
0
citations

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network

ICCV 2019
0
citations

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

ICCV 2021arXiv
0
citations

DetCo: Unsupervised Contrastive Learning for Object Detection

ICCV 2021arXiv
0
citations

FB-BEV: BEV Representation from Forward-Backward View Transformations

ICCV 2023
0
citations

Scene Text Image Super-resolution in the wild

ECCV 2020
0
citations

Segmenting Transparent Objects in the Wild

ECCV 2020
0
citations

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

ECCV 2020
0
citations

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

ECCV 2022
0
citations

VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

ECCV 2022
0
citations

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

CVPR 2025
0
citations

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

AAAI 2025
0
citations

Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting

AAAI 2025
0
citations

AVSegFormer: Audio-Visual Segmentation with Transformer

AAAI 2024arXiv
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

Selective Kernel Networks

CVPR 2019
0
citations

Shape Robust Text Detection With Progressive Scale Expansion Network

CVPR 2019
0
citations

PolarMask: Single Shot Instance Segmentation With Polar Representation

CVPR 2020arXiv
0
citations

Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

CVPR 2021arXiv
0
citations

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

CVPR 2022arXiv
0
citations

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

NeurIPS 2020
0
citations

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

NeurIPS 2021
0
citations

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

NeurIPS 2022
0
citations

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

NeurIPS 2023
0
citations

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

NeurIPS 2023
0
citations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

NeurIPS 2023
0
citations