Zhen Li

57

Papers

285

Total Citations

Papers (57)

Learning Semantic Relationships for Better Action Retrieval in Images

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

DV-3DLane: End-to-end Multi-modal 3D Lane Detection with Dual-view Representation

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

Topo2Seq: Enhanced Topology Reasoning via Topology Sequence Learning

Empowering Large Language Models with 3D Situation Awareness

AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models

SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving

Feedback Network for Image Super-Resolution

PointASNL: Robust Point Clouds Processing Using Nonlocal Neural Networks With Adaptive Sampling

Exemplar Normalization for Learning Deep Representation

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

Shallow Feature Matters for Weakly Supervised Object Localization

X-Trans2Cap: Cross-Modal Knowledge Transfer Using Transformer for 3D Dense Captioning

PhyIR: Physics-Based Inverse Rendering for Panoramic Indoor Images

Towards an End-to-End Framework for Flow-Guided Video Inpainting

Beyond 3D Siamese Tracking: A Motion-Centric Paradigm for 3D Single Object Tracking in Point Clouds

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

BEV@DC: Bird's-Eye View Assisted Training for Depth Completion

AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation

Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes

DNF: Decouple and Feedback Network for Seeing in the Dark

Learning Transformation-Predictive Representations for Detection and Description of Local Features

High-Resolution Shape Completion Using Deep Neural Networks for Global Structure and Local Geometry Inference

Semi-Supervised Video Salient Object Detection Using Pseudo-Labels

Box-Aware Feature Enhancement for Single Object Tracking on Point Clouds

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring

SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

LATR: 3D Lane Detection from Monocular Images with Transformer

RankMatch: Fostering Confidence and Consistency in Learning with Noisy Labels

SRFormer: Permuted Self-Attention for Single Image Super-Resolution

Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

Weakly Supervised Object Localization through Inter-class Feature Similarity and Intra-Class Appearance Consistency

Free-Form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering

K-LoRA: Unlocking Training-Free Fusion of Any Subject and Style LoRAs

AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

Consistency of Compositional Generalization Across Multiple Levels

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

WeakPCSOD: Overcoming the Bias of Box Annotations for Weakly Supervised Point Cloud Salient Object Detection

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Unified Generation, Reconstruction, and Representation: Generalized Diffusion with Adaptive Latent Encoding-Decoding

Blockout: Dynamic Model Selection for Hierarchical Deep Networks

Deep Neural Nets with Interpolating Function as Output Activation

Divide and Contrast: Source-free Domain Adaptation via Adaptive Contrastive Learning

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

Amazon-M2: A Multilingual Multi-locale Shopping Session Dataset for Recommendation and Text Generation