Difei Gao
14
Papers
250
Total Citations
Papers (14)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
123
citations
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024
109
citations
AssistGUI: Task-Oriented PC Graphical User Interface Automation
CVPR 2024
18
citations
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text
CVPR 2020arXiv
0
citations
Affordance Grounding From Demonstration Video To Target Image
CVPR 2023arXiv
0
citations
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
ICCV 2021
0
citations
Learning to Learn: How to Continuously Teach Humans and Machines
ICCV 2023arXiv
0
citations
UniVTG: Towards Unified Video-Language Temporal Grounding
ICCV 2023arXiv
0
citations
"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"
ECCV 2022
0
citations
AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant
ECCV 2022
0
citations
MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
CVPR 2023arXiv
0
citations
Factorized Learning for Temporally Grounded Video-Language Models
ICCV 2025
0
citations
ViT-Lens: Towards Omni-modal Representations
CVPR 2024
0
citations
Egocentric Video-Language Pretraining
NeurIPS 2022
0
citations