Zhengyuan Yang
17
Papers
316
Total Citations
Papers (17)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
123
citations
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
49
citations
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
ICML 2025
44
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025arXiv
34
citations
SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation
AAAI 2024arXiv
23
citations
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025arXiv
17
citations
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
ICLR 2025
14
citations
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
NeurIPS 2025
12
citations
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
0
citations
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
0
citations
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
0
citations
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
0
citations
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
0
citations
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
0
citations
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
CVPR 2024
0
citations
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
0
citations
StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis
ICML 2024
0
citations