Licheng Yu
26
Papers
77
Total Citations
Papers (26)
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
CVPR 2024
63
citations
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
CVPR 2025arXiv
7
citations
ROICtrl: Boosting Instance Control for Visual Generation
CVPR 2025
7
citations
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
CVPR 2024
0
citations
Layout-Agnostic Scene Text Image Synthesis with Diffusion Models
CVPR 2024
0
citations
Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis
CVPR 2024
0
citations
A Joint Speaker-Listener-Reinforcer Model for Referring Expressions
CVPR 2017arXiv
0
citations
MAttNet: Modular Attention Network for Referring Expression Comprehension
CVPR 2018arXiv
0
citations
Multi-Target Embodied Question Answering
CVPR 2019
0
citations
BachGAN: High-Resolution Image Synthesis From Salient Object Layout
CVPR 2020arXiv
0
citations
Violin: A Large-Scale Dataset for Video-and-Language Inference
CVPR 2020arXiv
0
citations
Connecting What To Say With Where To Look by Modeling Human Attention Traces
CVPR 2021arXiv
0
citations
Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment
CVPR 2022arXiv
0
citations
Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations
CVPR 2023arXiv
0
citations
Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation
CVPR 2023arXiv
0
citations
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
CVPR 2023
0
citations
Visual Madlibs: Fill in the Blank Description Generation and Question Answering
ICCV 2015
0
citations
CiT: Curation in Training for Effective Vision-Language Data
ICCV 2023arXiv
0
citations
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
ECCV 2020
0
citations
TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
ECCV 2020
0
citations
UNITER: UNiversal Image-TExt Representation Learning
ECCV 2020
0
citations
FashionViL: Fashion-Focused Vision-and-Language Representation Learning
ECCV 2022
0
citations
Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction
CVPR 2025
0
citations
"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"
ECCV 2022
0
citations
Apollo: An Exploration of Video Understanding in Large Multimodal Models
CVPR 2025
0
citations
AVID: Any-Length Video Inpainting with Diffusion Model
CVPR 2024
0
citations