Licheng Yu

26

Papers

77

Total Citations

Papers (26)

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

ROICtrl: Boosting Instance Control for Visual Generation

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Layout-Agnostic Scene Text Image Synthesis with Diffusion Models

Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

MAttNet: Modular Attention Network for Referring Expression Comprehension

Multi-Target Embodied Question Answering

BachGAN: High-Resolution Image Synthesis From Salient Object Layout

Violin: A Large-Scale Dataset for Video-and-Language Inference

Connecting What To Say With Where To Look by Modeling Human Attention Traces

Unsupervised Vision-and-Language Pre-Training via Retrieval-Based Multi-Granular Alignment

Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations

Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation

FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks

Visual Madlibs: Fill in the Blank Description Generation and Question Answering

CiT: Curation in Training for Effective Vision-Language Data

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval

UNITER: UNiversal Image-TExt Representation Learning

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

Apollo: An Exploration of Video Understanding in Large Multimodal Models

AVID: Any-Length Video Inpainting with Diffusion Model