Zhengyuan Yang

17
Papers
316
Total Citations

Papers (17)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

CVPR 2025
123
citations

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

CVPR 2024
49
citations

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

ICML 2025
44
citations

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

ICLR 2025arXiv
34
citations

SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation

AAAI 2024arXiv
23
citations

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

ICLR 2025arXiv
17
citations

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

ICLR 2025
14
citations

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

NeurIPS 2025
12
citations

LiVOS: Light Video Object Segmentation with Gated Linear Matching

CVPR 2025
0
citations

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

ICML 2024
0
citations

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

ICCV 2025
0
citations

SITE: towards Spatial Intelligence Thorough Evaluation

ICCV 2025
0
citations

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

ICCV 2025
0
citations

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

CVPR 2024
0
citations

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

CVPR 2024
0
citations

DisCo: Disentangled Control for Realistic Human Dance Generation

CVPR 2024
0
citations

StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis

ICML 2024
0
citations