Difei Gao

14

Papers

250

Total Citations

Papers (14)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

VideoLLM-online: Online Video Large Language Model for Streaming Video

AssistGUI: Task-Oriented PC Graphical User Interface Automation

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Affordance Grounding From Demonstration Video To Target Image

Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments

Learning to Learn: How to Continuously Teach Humans and Machines

UniVTG: Towards Unified Video-Language Temporal Grounding

"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"

AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant

MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering

Factorized Learning for Temporally Grounded Video-Language Models

ViT-Lens: Towards Omni-modal Representations

Egocentric Video-Language Pretraining