Zhen Li

21

Papers

171

Total Citations

Papers (21)

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Topo2Seq: Enhanced Topology Reasoning via Topology Sequence Learning

Empowering Large Language Models with 3D Situation Awareness

SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

Consistency of Compositional Generalization Across Multiple Levels

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding