Xiangtai Li

25
Papers
618
Total Citations

Papers (25)

OMG-Seg: Is One Model Good Enough For All Segmentation?

CVPR 2024
106
citations

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

ICLR 2024
104
citations

Point Cloud Mamba: Point Cloud Learning via State Space Model

AAAI 2025
81
citations

Towards Semantic Equivalence of Tokenization in Multimodal LLM

ICLR 2025
57
citations

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

CVPR 2024
53
citations

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

ICLR 2025
43
citations

PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning

AAAI 2025
30
citations

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

CVPR 2024
30
citations

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

CVPR 2024
26
citations

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

ICCV 2025
20
citations

Improving Video Segmentation via Dynamic Anchor Queries

ECCV 2024
19
citations

Explore In-Context Segmentation via Latent Diffusion Models

AAAI 2025
14
citations

DreamRelation: Bridging Customization and Relation Generation

CVPR 2025arXiv
10
citations

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

ICCV 2025
10
citations

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

ICCV 2025
6
citations

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

CVPR 2025
4
citations

Conditional Panoramic Image Generation via Masked Autoregressive Modeling

NeurIPS 2025
4
citations

PointDGMamba: Domain Generalization of Point Cloud Classification via Generalized State Space Model

AAAI 2025
1
citations

DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation

CVPR 2025
0
citations

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

CVPR 2025
0
citations

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

ICCV 2025
0
citations

Referring Image Editing: Object-level Image Editing via Referring Expressions

CVPR 2024
0
citations

BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model

CVPR 2024
0
citations

Unified Dense Prediction of Video Diffusion

CVPR 2025
0
citations

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

CVPR 2025
0
citations