Wenhai Wang
40
Papers
2,577
Total Citations
Papers (40)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
2,210
citations
Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation
ECCV 2020
138
citations
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024
86
citations
ControlLLM: Augment Language Models with Tools by Searching on Graphs
ECCV 2024arXiv
57
citations
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
52
citations
Docopilot: Improving Multimodal Models for Document-Level Understanding
CVPR 2025
14
citations
Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction
ICCV 2025
6
citations
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
ICML 2025
6
citations
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
CVPR 2025
5
citations
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
NeurIPS 2025
2
citations
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
NeurIPS 2025
1
citations
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023arXiv
0
citations
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
CVPR 2023arXiv
0
citations
Planning-Oriented Autonomous Driving
CVPR 2023arXiv
0
citations
Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network
ICCV 2019
0
citations
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions
ICCV 2021arXiv
0
citations
DetCo: Unsupervised Contrastive Learning for Object Detection
ICCV 2021arXiv
0
citations
FB-BEV: BEV Representation from Forward-Backward View Transformations
ICCV 2023
0
citations
Scene Text Image Super-resolution in the wild
ECCV 2020
0
citations
Segmenting Transparent Objects in the Wild
ECCV 2020
0
citations
AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting
ECCV 2020
0
citations
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers
ECCV 2022
0
citations
VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
ECCV 2022
0
citations
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
CVPR 2025
0
citations
ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area
AAAI 2025
0
citations
Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting
AAAI 2025
0
citations
AVSegFormer: Audio-Visual Segmentation with Transformer
AAAI 2024arXiv
0
citations
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
0
citations
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
ICML 2024
0
citations
Selective Kernel Networks
CVPR 2019
0
citations
Shape Robust Text Detection With Progressive Scale Expansion Network
CVPR 2019
0
citations
PolarMask: Single Shot Instance Segmentation With Polar Representation
CVPR 2020arXiv
0
citations
Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection
CVPR 2021arXiv
0
citations
Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers
CVPR 2022arXiv
0
citations
Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection
NeurIPS 2020
0
citations
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
NeurIPS 2021
0
citations
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
NeurIPS 2022
0
citations
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
NeurIPS 2023
0
citations
Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection
NeurIPS 2023
0
citations
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NeurIPS 2023
0
citations