Ming-Hsuan Yang

45
Papers
1,144
Total Citations

Papers (45)

Language Model Beats Diffusion - Tokenizer is key to visual generation

ICLR 2024
525
citations

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

CVPR 2024
341
citations

VidToMe: Video Token Merging for Zero-Shot Video Editing

CVPR 2024
89
citations

Exploiting Diffusion Prior for Generalizable Dense Prediction

CVPR 2024
42
citations

Multi-subject Open-set Personalization in Video Generation

CVPR 2025arXiv
40
citations

Calibrated Multi-Preference Optimization for Aligning Diffusion Models

CVPR 2025
24
citations

Efficient Visual State Space Model for Image Deblurring

CVPR 2025
23
citations

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation including the Unseen

AAAI 2024arXiv
15
citations

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

ICCV 2025
9
citations

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

NeurIPS 2025
8
citations

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

CVPR 2025
8
citations

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

CVPR 2025
5
citations

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

CVPR 2024
4
citations

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

ICCV 2025
3
citations

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

NeurIPS 2025
3
citations

Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

ICCV 2025
2
citations

Toward Material-Agnostic System Identification from Videos

ICCV 2025
1
citations

CompleteMe: Reference-based Human Image Completion

ICCV 2025
1
citations

From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

ICCV 2025
1
citations

GLaMM: Pixel Grounding Large Multimodal Model

CVPR 2024
0
citations

Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring

CVPR 2024
0
citations

UniGS: Unified Representation for Image Generation and Segmentation

CVPR 2024
0
citations

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

CVPR 2024
0
citations

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

ICML 2024
0
citations

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

ICML 2024
0
citations

VideoPoet: A Large Language Model for Zero-Shot Video Generation

ICML 2024
0
citations

VideoPrism: A Foundational Visual Encoder for Video Understanding

ICML 2024
0
citations

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

CVPR 2025
0
citations

UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior

CVPR 2025
0
citations

Move-in-2D: 2D-Conditioned Human Motion Generation

CVPR 2025
0
citations

Unified Dense Prediction of Video Diffusion

CVPR 2025
0
citations

Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

ICCV 2025
0
citations

FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

ICCV 2025
0
citations

Efficient Concertormer for Image Deblurring and Beyond

ICCV 2025arXiv
0
citations

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

ICCV 2025
0
citations

Controllable 3D Outdoor Scene Generation via Scene Graphs

ICCV 2025
0
citations

Generating Synthetic Data for Unsupervised Federated Learning of Cross-Modal Retrieval

AAAI 2025
0
citations

BEV-MAE: Bird’s Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

AAAI 2024
0
citations

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

CVPR 2024
0
citations

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

CVPR 2024
0
citations

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

CVPR 2024
0
citations

RTracker: Recoverable Tracking via PN Tree Structured Memory

CVPR 2024
0
citations

Text-Driven Image Editing via Learnable Regions

CVPR 2024
0
citations

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

CVPR 2024
0
citations

Weakly Supervised Video Individual Counting

CVPR 2024
0
citations