Xiang Bai

22

Papers

607

Total Citations

Papers (22)

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

General Object Foundation Model for Images and Videos at Scale

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

SEED: A Simple and Effective 3D DETR in Point Clouds

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Bridging the Gap Between End-to-End and Two-Step Text Spotting

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

PlayerOne: Egocentric World Simulator

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Training-free Geometric Image Editing on Diffusion Models

OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

MINIMA: Modality Invariant Image Matching

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

Multi-scenario Overlapping Text Segmentation with Depth Awareness