Dongdong Chen

70

Papers

92

Total Citations

Papers (70)

OmniViD: A Generative Framework for Universal Video Understanding

Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

SmartEraser: Remove Anything from Images using Masked-Region Guidance

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

UNICL-SAM: Uncertainty-Driven In-Context Segmentation with Part Prototype Discovery

Olympus: A Universal Task Router for Computer Vision Tasks

Exploring Invariance in Images through One-way Wave Equations

Bringing Old Photos Back to Life

Robust Superpixel-Guided Attentional Adversarial Attack

Dynamic Convolution: Attention Over Convolution Kernels

Self-Robust 3D Point Recognition via Gather-Vector Guidance

Density-Aware Graph for Deep Semi-Supervised Visual Recognition

Unsupervised Pre-Training for Person Re-Identification

Diverse Semantic Image Synthesis via Probability Distribution Modeling

Dynamic Head: Unifying Object Detection Heads With Attentions

Improved Image Matting via Real-Time User Clicks and Uncertainty Estimation

Multi-Attentional Deepfake Detection

Mobile-Former: Bridging MobileNet and Transformer

CLIP-NeRF: Text-and-Image Driven Manipulation of Neural Radiance Fields

CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows

Reduce Information Loss in Transformers for Pluralistic Image Inpainting

Large-Scale Pre-Training for Person Re-Identification With Noisy Labels

BEVT: BERT Pretraining of Video Transformers

Shape-Invariant 3D Adversarial Point Clouds

HairCLIP: Design Your Hair by Text and Reference Image

Bringing Old Films Back to Life

Robust Equivariant Imaging: A Fully Unsupervised Framework for Learning To Image From Noisy and Partial Measurements

General Facial Representation Learning in a Visual-Linguistic Manner

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Protecting Celebrities From DeepFake With Identity Consistency Transformer

Diversity-Aware Meta Visual Prompting

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Streaming Video Model

Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

Coherent Online Video Style Transfer

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

Learning With Noisy Labels for Robust Point Cloud Segmentation

High-Fidelity Pluralistic Image Completion With Transformers

Equivariant Imaging: Learning Beyond the Range Space

MicroNet: Improving Image Recognition With Extremely Low FLOPs

Improve Unsupervised Pretraining for Few-Label Transfer

Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting

HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending

AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control

Dynamic ReLU

DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search

Deep Decomposition Learning for Inverse Imaging Problems

Should All Proposals Be Treated Equally in Object Detection?

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks

Show and Segment: Universal Medical Image Segmentation via In-Context Learning

I2V3D: Controllable Image-to-video Generation with 3D Guidance

Equivariant Multi-Modality Image Fusion

Towards More Unified In-context Visual Understanding

Image Fusion via Vision-Language Model

StyleBank: An Explicit Representation for Neural Image Style Transfer

Stereoscopic Neural Style Transfer

Transductive Zero-Shot Learning with Visual Structure Constraint

GreedyFool: Distortion-Aware Sparse Adversarial Attack

Passport-aware Normalization for Deep Model Protection

Stronger NAS with Weaker Predictors

Unsupervised Learning From Incomplete Measurements for Inverse Problems

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question Answering

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection