Ming-Hsuan Yang

199

Papers

2,852

Total Citations

Papers (199)

Universal Style Transfer via Feature Transforms

NeurIPS 2017arXiv

Language Model Beats Diffusion - Tokenizer is key to visual generation

Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers

Learning Affinity via Spatial Propagation Networks

NeurIPS 2017arXiv

Every Pixel Matters: Center-aware Feature Alignment for Domain Adaptive Object Detector

VidToMe: Video Token Merging for Zero-Shot Video Editing

RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval

Exploiting Diffusion Prior for Generalizable Dense Prediction

Multi-subject Open-set Personalization in Video Generation

Calibrated Multi-Preference Optimization for Aligning Diffusion Models

Efficient Visual State Space Model for Image Deblurring

Controllable Image Synthesis via SegVAE

CSL: Class-Agnostic Structure-Constrained Learning for Segmentation including the Unseen

AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation

Cropper: Vision-Language Model for Image Cropping through In-Context Learning

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Learning Deblurring Texture Prior from Unpaired Data with Diffusion Model

CompleteMe: Reference-based Human Image Completion

From Prompt to Progression: Taming Video Diffusion Models for Seamless Attribute Transition

Toward Material-Agnostic System Identification from Videos

GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting

VideoPoet: A Large Language Model for Zero-Shot Video Generation

VideoPrism: A Foundational Visual Encoder for Video Understanding

Structural Sparse Tracking

Adaptive Region Pooling for Object Detection

PatchCut: Data-Driven Object Segmentation via Local Shape Transfer

Salient Object Detection via Bootstrap Learning

JOTS: Joint Online Tracking and Segmentation

Deep Networks for Saliency Detection via Local Estimation and Global Search

Multi-Objective Convolutional Learning for Face Labeling

Multi-Instance Object Segmentation With Occlusion Handling

Long-Term Correlation Tracking

Object Contour Detection With a Fully Convolutional Encoder-Decoder Network

Soft-Segmentation Guided Object Motion Deblurring

Online Multi-Object Tracking via Structural Constraint Event Aggregation

Blind Image Deblurring Using Dark Channel Prior

A Comparative Study for Single Image Blind Deblurring

Image Deblurring Using Smartphone Inertial Sensors

Robust Kernel Estimation With Outliers Handling for Image Deblurring

Weakly Supervised Object Localization With Progressive Domain Adaptation

Video Segmentation via Object Flow

Object Tracking via Dual Linear Structured SVM and Explicit Feature Map

Hedged Deep Tracking

Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution

Deep Image Harmonization

Learning Fully Convolutional Networks for Iterative Non-Blind Deconvolution

Generative Face Completion

Diversified Texture Synthesis With Feed-Forward Networks

Multi-Task Correlation Particle Filter for Robust Object Tracking

Correlation Tracking via Joint Discrimination and Reliability Learning

Learning Superpixels With Segmentation-Aware Affinity Loss

Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks

SPLATNet: Sparse Lattice Networks for Point Cloud Processing

Learning Dual Convolutional Neural Networks for Low-Level Vision

PiCANet: Learning Pixel-Wise Contextual Attention for Saliency Detection

Gated Fusion Network for Single Image Dehazing

Learning to Localize Sound Source in Visual Scenes

Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking

Learning a Discriminative Prior for Blind Image Deblurring

Fast and Accurate Online Video Object Segmentation via Tracking Parts

Learning to Adapt Structured Output Space for Semantic Segmentation

Weakly Supervised Coupled Networks for Visual Sentiment Analysis

Deep Semantic Face Deblurring

Learning Spatial-Aware Regressions for Visual Tracking

VITAL: VIsual Tracking via Adversarial Learning

Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation

SCOPS: Self-Supervised Co-Part Segmentation

Target-Aware Deep Tracking

Mode Seeking Generative Adversarial Networks for Diverse Image Synthesis

Im2Pencil: Controllable Pencil Illustration From Photographs

Spatially Variant Linear Representation Models for Joint Filtering

CrDoCo: Pixel-Level Domain Transfer With Cross-Domain Consistency

Depth-Aware Video Frame Interpolation

Learning Linear Transformations for Fast Image and Video Style Transfer

Inserting Videos Into Videos

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

Single-Image HDR Reconstruction by Learning to Reverse the Camera Pipeline

Composing Good Shots by Exploiting Mutual Relations

CycleISP: Real Image Restoration via Improved Data Synthesis

Multi-Scale Boosted Dehazing Network With Dense Feature Fusion

Collaborative Distillation for Ultra-Resolution Universal Style Transfer

Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective

Weakly-Supervised Semantic Segmentation via Sub-Category Exploration

Learning to See Through Obstructions

ReMix: Towards Image-to-Image Translation With Limited Data

Regularizing Generative Adversarial Networks Under Limited Data

Decoupled Dynamic Filter Networks

Spatiotemporal Contrastive Video Representation Learning

Multi-Stage Progressive Image Restoration

Contextualized Spatio-Temporal Contrastive Learning With Self-Supervision

Video Frame Interpolation Transformer

Burst Image Restoration and Enhancement

Restormer: Efficient Transformer for High-Resolution Image Restoration

Hierarchical Modular Network for Video Captioning

InOut: Diverse Image Outpainting via GAN Inversion

Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection

Burstormer: Burst Image Restoration and Enhancement Transformer

Self-Supervised Super-Plane for Neural 3D Reconstruction

MAGVIT: Masked Generative Video Transformer

Improving Zero-Shot Generalization and Robustness of Multi-Modal Models

Learning To Dub Movies via Hierarchical Prosody Models

Self-Supervised AutoFlow

Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble

What Makes an Object Memorable?

Fast and Accurate Head Pose Estimation via Random Projection Forests

Hierarchical Convolutional Features for Visual Tracking

Learning to Super-Resolve Blurry Face and Text Images

Unsupervised Representation Learning by Sorting Sequences

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

Learning Discriminative Data Fitting Functions for Blind Image Deblurring

Video Deblurring via Semantic Segmentation and Pixel-Wise Non-Linear Kernel

Blind Image Deblurring With Outlier Handling

CREST: Convolutional Residual Learning for Visual Tracking

Scene Parsing With Global Context Embedding

Unsupervised Domain Adaptation for Face Recognition in Unlabeled Videos

Referring Expression Generation and Comprehension via Attributes

The Road To Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation

Learning To Stylize Novel Views

COMISR: Compression-Informed Video Super-Resolution

Hybrid Neural Fusion for Full-Frame Video Stabilization

Discovering 3D Parts From Image Collections

Benchmarking Ultra-High-Definition Image Super-Resolution

Video Matting via Consistency-Regularized Graph Neural Networks

D2-Net: Weakly-Supervised Action Localization via Discriminative Embeddings and Denoised Activations

Unified Visual Relationship Detection with Vision and Language Models

SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

MiniROAD: Minimal RNN Framework for Online Action Detection

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

InfiniCity: Infinite-Scale City Synthesis

CiteTracker: Correlating Image and Text for Visual Tracking

High Quality Entity Segmentation

Counting Crowds in Bad Weather

Neural Design Network: Graphic Layout Generation with Constraints

Learnable Cost Volume Using the Cayley Representation

Video Object Detection via Object-level Temporal Aggregation

Self-supervised Single-view 3D Reconstruction via Semantic Consistency

Modeling Artistic Workflows for Image Generation and Editing

Adversarial Training with Bi-directional Likelihood Regularization for Visual Classification

Learning Enriched Features for Real Image Restoration and Enhancement

Learning Visibility for Robust Dense Human Body Estimation

Autoregressive 3D Shape Generation via Canonical Mapping

Class-Agnostic Object Detection with Multi-modal Transformer

Adaptive Transformers for Robust Few-Shot Cross-Domain Face Anti-Spoofing

Scraping Textures from Natural Images for Synthesis and Editing

Learning Discriminative Shrinkage Deep Networks for Image Deconvolution

CA-SSL: Class-Agnostic Semi-Supervised Learning for Detection and Segmentation

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer

Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis

Semi-Supervised Learning for Optical Flow with Generative Adversarial Networks

CLR: Channel-wise Lightweight Reprogramming for Continual Learning

DynamicScaler: Seamless and Scalable Video Generation for Panoramic Scenes

UniRestore: Unified Perceptual and Task-Oriented Image Restoration Model Using Diffusion Prior

Move-in-2D: 2D-Conditioned Human Motion Generation

Unified Dense Prediction of Video Diffusion

Frequency Domain-Based Diffusion Model for Unpaired Image Dehazing

FaceLift: Learning Generalizable Single Image 3D Face Reconstruction from Synthetic Heads

Efficient Concertormer for Image Deblurring and Beyond

QK-Edit: Revisiting Attention-based Injection in MM-DiT for Image and Video Editing

Controllable 3D Outdoor Scene Generation via Scene Graphs

Generating Synthetic Data for Unsupervised Federated Learning of Cross-Modal Retrieval

BEV-MAE: Bird’s Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence

RTracker: Recoverable Tracking via PN Tree Structured Memory

Text-Driven Image Editing via Learnable Regions

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Weakly Supervised Video Individual Counting

GLaMM: Pixel Grounding Large Multimodal Model

Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring

UniGS: Unified Representation for Image Generation and Segmentation

PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

Deep Non-Blind Deconvolution via Generalized Low-Rank Approximation

Deep Attentive Tracking via Reciprocative Learning

Context-aware Synthesis and Placement of Object Instances

Joint-task Self-supervised Learning for Temporal Correspondence

Dancing to Music

Quadratic Video Interpolation

Online Adaptation for Consistent Mesh Reconstruction in the Wild

Learning 3D Dense Correspondence via Canonical Point Autoencoder

Exploring Cross-Video and Cross-Modality Signals for Weakly-Supervised Audio-Visual Video Parsing

Intriguing Properties of Vision Transformers

End-to-end Multi-modal Video Temporal Grounding

LASSIE: Learning Articulated Shapes from Sparse Image Ensemble via 3D Part Discovery

AIMS: All-Inclusive Multi-Level Segmentation for Anything

Video Timeline Modeling For News Story Understanding

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections

Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs

Module-wise Adaptive Distillation for Multimodality Foundation Models