Wei Zhang

46

Papers

386

Total Citations

Papers (46)

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Latent Space Editing in Transformer-Based Flow Matching

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

Language-Driven Anchors for Zero-Shot Adversarial Robustness

Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

Object Detection using Event Camera: A MoE Heat Conduction based Detector and A New Benchmark Dataset

HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

Gaussian Process Neural Additive Models

LaneGraph2Seq: Lane Topology Extraction with Language Model via Vertex-Edge Encoding and Connectivity Enhancement

GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

PBCAT: Patch-Based Composite Adversarial Training against Physically Realizable Attacks on Object Detection

As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

Less Attention is More: Prompt Transformer for Generalized Category Discovery

EasyCraft: A Robust and Efficient Framework for Automatic Avatar Crafting

Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment

SleepSMC: Ubiquitous Sleep Staging via Supervised Multimodal Coordination

Context Guided Transformer Entropy Modeling for Video Compression

Learning Implicit Features with Flow-Infused Transformations for Realistic Virtual Try-On

Symbolic Cognitive Diagnosis via Hybrid Optimization for Intelligent Education Systems

Decoupled Motion Expression Video Segmentation

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

AdaDrive: Self-Adaptive Slow-Fast System for Language-Grounded Autonomous Driving

VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving

LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

General Compression Framework for Efficient Transformer Object Tracking

Efficient Event Camera Data Pretraining with Adaptive Prompt Fusion

PerReactor: Offline Personalised Multiple Appropriate Facial Reaction Generation

In2NeCT: Inter-class and Intra-class Neural Collapse Tuning for Semantic Segmentation of Imbalanced Remote Sensing Images

Coherency Improved Explainable Recommendation via Large Language Model

STAIR: Manipulating Collaborative and Multimodal Information for E-Commerce Recommendation

CGMGM: A Cross-Gaussian Mixture Generative Model for Few-Shot Semantic Segmentation

SFOD: Spiking Fusion Object Detector

EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

Event-based Visible and Infrared Fusion via Multi-task Collaboration

Holistic Autonomous Driving Understanding by Bird’s-Eye-View Injected Multi-Modal Large Models

HetSSNet: Spatial-Spectral Heterogeneous Graph Learning Network for Panchromatic and Multispectral Images Fusion

ESNet: Evolution and Succession Network for High-Resolution Salient Object Detection

Interpreting and Improving Large Language Models in Arithmetic Calculation