Ying Shan

106
Papers
2,552
Total Citations

Papers (106)

T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion

AAAI 2024arXiv
1,423
citations

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

CVPR 2024
237
citations

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

CVPR 2024
139
citations

ST-LLM: Large Language Models Are Effective Temporal Learners

ECCV 2024
124
citations

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

ICLR 2024
110
citations

Taming Rectified Flow for Inversion and Editing

ICML 2025
110
citations

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

CVPR 2024
89
citations

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

ECCV 2024
50
citations

Image Conductor: Precision Control for Interactive Video Synthesis

AAAI 2025
46
citations

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

CVPR 2025
44
citations

TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models

ICCV 2025
35
citations

Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

CVPR 2025arXiv
23
citations

GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

ICCV 2025
19
citations

Programmable Motion Generation for Open-Set Motion Control Tasks

CVPR 2024
16
citations

Scalable Image Tokenization with Index Backpropagation Quantization

ICCV 2025
16
citations

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

CVPR 2024
15
citations

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

CVPR 2024
11
citations

SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views

AAAI 2024arXiv
10
citations

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

ICCV 2025
9
citations

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

CVPR 2025
7
citations

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

ICML 2025
6
citations

NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images

CVPR 2025
6
citations

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning

NeurIPS 2025arXiv
4
citations

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion

CVPR 2025arXiv
3
citations

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

CVPR 2024
0
citations

ViT-Lens: Towards Omni-modal Representations

CVPR 2024
0
citations

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

CVPR 2024
0
citations

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

CVPR 2021arXiv
0
citations

Open-Book Video Captioning With Retrieve-Copy-Generate Network

CVPR 2021arXiv
0
citations

Towards Real-World Blind Face Restoration With Generative Facial Prior

CVPR 2021arXiv
0
citations

Bridging Video-Text Retrieval With Multiple Choice Questions

CVPR 2022arXiv
0
citations

Object-Aware Video-Language Pre-Training for Retrieval

CVPR 2022arXiv
0
citations

BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild

CVPR 2022
0
citations

Temporally Efficient Vision Transformer for Video Instance Segmentation

CVPR 2022arXiv
0
citations

UMT: Unified Multi-Modal Transformers for Joint Video Moment Retrieval and Highlight Detection

CVPR 2022arXiv
0
citations

Accelerating Vision-Language Pretraining With Free Language Modeling

CVPR 2023arXiv
0
citations

3D GAN Inversion With Facial Symmetry Prior

CVPR 2023arXiv
0
citations

Generating Human Motion From Textual Descriptions With Discrete Representations

CVPR 2023arXiv
0
citations

DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

CVPR 2023arXiv
0
citations

DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks

CVPR 2023arXiv
0
citations

Improved Test-Time Adaptation for Domain Generalization

CVPR 2023arXiv
0
citations

HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions

CVPR 2023
0
citations

High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors

CVPR 2023arXiv
0
citations

All in One: Exploring Unified Video-Language Pre-Training

CVPR 2023arXiv
0
citations

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

CVPR 2023arXiv
0
citations

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

CVPR 2023arXiv
0
citations

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

CVPR 2023arXiv
0
citations

OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer

CVPR 2023arXiv
0
citations

Learning Anchor Transformations for 3D Garment Animation

CVPR 2023arXiv
0
citations

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

CVPR 2023
0
citations

RILS: Masked Visual Reconstruction in Language Semantic Space

CVPR 2023arXiv
0
citations

SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

CVPR 2023arXiv
0
citations

Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry

CVPR 2023arXiv
0
citations

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

CVPR 2023arXiv
0
citations

Instances As Queries

ICCV 2021
0
citations

Crossover Learning for Fast Online Video Instance Segmentation

ICCV 2021arXiv
0
citations

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

ICCV 2023arXiv
0
citations

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

ICCV 2023
0
citations

Order-Prompted Tag Sequence Generation for Video Tagging

ICCV 2023
0
citations

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

ICCV 2023arXiv
0
citations

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

ICCV 2023arXiv
0
citations

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

ICCV 2023arXiv
0
citations

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

ICCV 2023arXiv
0
citations

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

ICCV 2023arXiv
0
citations

Exploring Model Transferability through the Lens of Potential Energy

ICCV 2023arXiv
0
citations

Fast Video Object Segmentation using the Global Context Module

ECCV 2020
0
citations

Metric Learning Based Interactive Modulation for Real-World Super-Resolution

ECCV 2022
0
citations

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

ECCV 2022
0
citations

Mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training

ECCV 2022
0
citations

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

ECCV 2022
0
citations

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

ECCV 2022
0
citations

Towards Vivid and Diverse Image Colorization With Generative Color Prior

ICCV 2021arXiv
0
citations

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

CVPR 2025
0
citations

DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation

CVPR 2025
0
citations

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025
0
citations

VisionMath: Vision-Form Mathematical Problem-Solving

ICCV 2025
0
citations

FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

ICCV 2025
0
citations

Mamba-3VL: Taming State Space Model for 3D Vision Language Learning

ICCV 2025
0
citations

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

ICCV 2025
0
citations

DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation

ICCV 2025
0
citations

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

AAAI 2025
0
citations

A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields

AAAI 2024
0
citations

SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images

AAAI 2024
0
citations

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

CVPR 2024
0
citations

GS-IR: 3D Gaussian Splatting for Inverse Rendering

CVPR 2024
0
citations

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

CVPR 2024
0
citations

YOLO-World: Real-Time Open-Vocabulary Object Detection

CVPR 2024
0
citations

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

CVPR 2024
0
citations

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

CVPR 2024
0
citations

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

CVPR 2024
0
citations

SEED-Bench: Benchmarking Multimodal Large Language Models

CVPR 2024
0
citations

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

CVPR 2024
0
citations

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

CVPR 2024
0
citations

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

CVPR 2024
0
citations

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

CVPR 2024
0
citations

Detecting Interactions from Neural Networks via Topological Analysis

NeurIPS 2020
0
citations

Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

NeurIPS 2021
0
citations

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

NeurIPS 2022
0
citations

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

NeurIPS 2022
0
citations

PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas

NeurIPS 2023
0
citations

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

NeurIPS 2023
0
citations

CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation

NeurIPS 2023
0
citations

Exploiting Contextual Objects and Relations for 3D Visual Grounding

NeurIPS 2023
0
citations

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

NeurIPS 2023
0
citations

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

NeurIPS 2023
0
citations

Inserting Anybody in Diffusion Models via Celeb Basis

NeurIPS 2023
0
citations