Xihui Liu

46
Papers
1,260
Total Citations

Papers (46)

WorldSimBench: Towards Video Generation Models as World Simulators

ICML 2025
806
citations

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

ICCV 2025
127
citations

Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training

CVPR 2024
77
citations

GameFactory: Creating New Games with Generative Interactive Videos

ICCV 2025
63
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

NeurIPS 2025
60
citations

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

CVPR 2025
25
citations

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

ICCV 2025
22
citations

DreamComposer: Controllable 3D Object Generation via Multi-View Conditions

CVPR 2024
19
citations

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ICCV 2025
17
citations

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

ICCV 2025
17
citations

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

ICCV 2025
11
citations

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

CVPR 2025
10
citations

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

NeurIPS 2025
6
citations

Object Detection in Videos With Tubelet Proposal Networks

CVPR 2017arXiv
0
citations

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

CVPR 2019
0
citations

HMAR: Efficient Hierarchical Masked Auto-Regressive Image Generation

CVPR 2025
0
citations

Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning

CVPR 2023arXiv
0
citations

GLeaD: Improving GANs With a Generator-Leading Task

CVPR 2023arXiv
0
citations

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

CVPR 2023
0
citations

Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption

CVPR 2023
0
citations

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

CVPR 2023arXiv
0
citations

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

ICCV 2017
0
citations

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification

ICCV 2017
0
citations

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

ICCV 2019
0
citations

DDP: Diffusion Model for Dense Visual Prediction

ICCV 2023arXiv
0
citations

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

ECCV 2020
0
citations

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

ECCV 2022
0
citations

Bridging Video-Text Retrieval With Multiple Choice Questions

CVPR 2022arXiv
0
citations

Parallelized Autoregressive Visual Generation

CVPR 2025
0
citations

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

CVPR 2025
0
citations

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

CVPR 2025
0
citations

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025
0
citations

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

ICCV 2025
0
citations

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

ICCV 2025
0
citations

DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization

ICCV 2025
0
citations

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

CVPR 2024
0
citations

Point Transformer V3: Simpler Faster Stronger

CVPR 2024
0
citations

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

CVPR 2024
0
citations

UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

ICML 2025
0
citations

FiT: Flexible Vision Transformer for Diffusion Model

ICML 2024
0
citations

Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

NeurIPS 2019
0
citations

Point Transformer V2: Grouped Vector Attention and Partition-based Pooling

NeurIPS 2022
0
citations

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images

NeurIPS 2023
0
citations

CorresNeRF: Image Correspondence Priors for Neural Radiance Fields

NeurIPS 2023
0
citations

OV-PARTS: Towards Open-Vocabulary Part Segmentation

NeurIPS 2023
0
citations

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

NeurIPS 2023
0
citations