Xiang Bai

81

Papers

1,047

Total Citations

Papers (81)

EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

General Object Foundation Model for Images and Videos at Scale

ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

SEED: A Simple and Effective 3D DETR in Point Clouds

OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection

Bridging the Gap Between End-to-End and Two-Step Text Spotting

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding

PlayerOne: Egocentric World Simulator

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

DeepContour: A Deep Convolutional Feature Learned by Positive-Sharing Loss for Contour Detection

Object Skeleton Extraction in Natural Images by Fusing Scale-Associated Deep Side Outputs

Multi-Oriented Text Detection With Fully Convolutional Networks

Robust Scene Text Recognition With Automatic Rectification

GIFT: A Real-Time and Scalable 3D Shape Search Engine

Scalable Person Re-Identification on Supervised Smoothed Manifold

Detecting Oriented Text in Natural Images by Linking Segments

Multiple Instance Detection Network With Online Instance Classifier Refinement

Richer Convolutional Features for Edge Detection

Triplet-Center Loss for Multi-View 3D Object Retrieval

DOTA: A Large-Scale Dataset for Object Detection in Aerial Images

Rotation-Sensitive Regression for Oriented Scene Text Detection

Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation

Progressive Pose Attention Transfer for Person Image Generation

DeepFlux for Skeletons in the Wild

Super-BPD: Super Boundary-to-Pixel Direction for Fast Image Segmentation

Semantically Multi-Modal Image Synthesis

Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship

Scene Text Retrieval via Joint Text Detection and Similarity Learning

Multi-Shot Temporal Event Localization: A Benchmark

MOST: A Multi-Oriented Scene Text Detector With Localization Refinement

Knowledge Mining With Scene Text for Fine-Grained Recognition

An Empirical Study of End-to-End Temporal Action Detection

Vision-Language Pre-Training for Boosting Scene Text Detectors

Few Could Be Better Than All: Feature Sampling and Grouping for Scene Text Detection

Syntax-Aware Network for Handwritten Mathematical Expression Recognition

InstMove: Instance Motion for Object-Centric Video Segmentation

Turning a CLIP Model Into a Scene Text Detector

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

Side Adapter Network for Open-Vocabulary Semantic Segmentation

SOOD: Towards Semi-Supervised Oriented Object Detection

CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model

Modeling Entities As Semantic Points for Visual Information Extraction in the Wild

Relaxed Multiple-Instance SVM With Application to Object Discovery

Ensemble Diffusion for Retrieval

Asymmetric Non-Local Neural Networks for Semantic Segmentation

View N-Gram Network for 3D Object Retrieval

MINIMA: Modality Invariant Image Matching

Symmetry-Constrained Rectification Network for Scene Text Recognition

End-to-End Semi-Supervised Object Detection With Soft Teacher

A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection

ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer

Intra-class Feature Variation Distillation for Semantic Segmentation

Scene Text Image Super-resolution in the wild

Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting

AutoSTR: Efficient Backbone Search for Scene Text Recognition

An End-to-End Transformer Model for Crowd Localization

GitNet: Geometric Prior-Based Transformation for Birds-Eye-View Segmentation

CCPL: Contrastive Coherence Preserving Loss for Versatile Style Transfer

When Counting Meets HMER: Counting-Aware Network for Handwritten Mathematical Expression Recognition

Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning

Toward Understanding WordArt: Corner-Guided Transformer for Scene Text Recognition

SeqFormer: Sequential Transformer for Video Instance Segmentation

In Defense of Online Models for Video Instance Segmentation

A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model

Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting

A Unified Image-Dense Annotation Generation Model for Underwater Scenes

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

Multi-scenario Overlapping Text Segmentation with Depth Awareness

Training-free Geometric Image Editing on Diffusion Models

OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

Symmetry-Based Text Line Detection in Natural Scenes

Bootstrap Your Object Detector via Mixed Training

Query-based Temporal Fusion with Explicit Motion for 3D Object Detection