Xiang Bai

22
Papers
607
Total Citations

Papers (22)

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

CVPR 2024
384
citations

General Object Foundation Model for Images and Videos at Scale

CVPR 2024
79
citations

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

ICCV 2025arXiv
62
citations

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

ICCV 2025
22
citations

SEED: A Simple and Effective 3D DETR in Point Clouds

ECCV 2024
19
citations

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

ECCV 2024
12
citations

Bridging the Gap Between End-to-End and Two-Step Text Spotting

CVPR 2024
11
citations

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

ICCV 2025
10
citations

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

ICCV 2025arXiv
4
citations

PlayerOne: Egocentric World Simulator

NeurIPS 2025
3
citations

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

ICCV 2025
1
citations

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

ICCV 2025
0
citations

Training-free Geometric Image Editing on Diffusion Models

ICCV 2025arXiv
0
citations

OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

CVPR 2024
0
citations

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

CVPR 2025
0
citations

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

CVPR 2025
0
citations

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

CVPR 2024
0
citations

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

ICCV 2025
0
citations

MINIMA: Modality Invariant Image Matching

CVPR 2025
0
citations

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

ICCV 2025
0
citations

Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

ICCV 2025
0
citations

Multi-scenario Overlapping Text Segmentation with Depth Awareness

ICCV 2025
0
citations