Ying Shan

106

Papers

2,552

Total Citations

Papers (106)

T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

ST-LLM: Large Language Models Are Effective Temporal Learners

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Taming Rectified Flow for Inversion and Editing

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Image Conductor: Precision Control for Interactive Video Synthesis

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

Programmable Motion Generation for Open-Set Motion Control Tasks

Scalable Image Tokenization with Index Backpropagation Quantization

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

NeurIPS 2025arXiv

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

ViT-Lens: Towards Omni-modal Representations

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Open-Book Video Captioning With Retrieve-Copy-Generate Network

Towards Real-World Blind Face Restoration With Generative Facial Prior

Bridging Video-Text Retrieval With Multiple Choice Questions

Object-Aware Video-Language Pre-Training for Retrieval

BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild

Temporally Efficient Vision Transformer for Video Instance Segmentation

UMT: Unified Multi-Modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Accelerating Vision-Language Pretraining With Free Language Modeling

3D GAN Inversion With Facial Symmetry Prior

Generating Human Motion From Textual Descriptions With Discrete Representations

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks

Improved Test-Time Adaptation for Domain Generalization

HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions

High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors

All in One: Exploring Unified Video-Language Pre-Training

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer

Learning Anchor Transformations for 3D Garment Animation

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

RILS: Masked Visual Reconstruction in Language Semantic Space

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

Instances As Queries

Crossover Learning for Fast Online Video Instance Segmentation

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Order-Prompted Tag Sequence Generation for Video Tagging

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

Exploring Model Transferability through the Lens of Potential Energy

Fast Video Object Segmentation using the Global Context Module

Metric Learning Based Interactive Modulation for Real-World Super-Resolution

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

Mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

Towards Vivid and Diverse Image Colorization With Generative Color Prior

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

VisionMath: Vision-Form Mathematical Problem-Solving

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

Mamba-3VL: Taming State Space Model for 3D Vision Language Learning

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields

SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

GS-IR: 3D Gaussian Splatting for Inverse Rendering

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

YOLO-World: Real-Time Open-Vocabulary Object Detection

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

SEED-Bench: Benchmarking Multimodal Large Language Models

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

Detecting Interactions from Neural Networks via Topological Analysis

Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation

Exploiting Contextual Objects and Relations for 3D Visual Grounding

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Inserting Anybody in Diffusion Models via Celeb Basis