Chaoyou Fu

Google Scholar OpenReview

10

Papers

2,450

Total Citations

17

h-index

Papers (10)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

NeurIPS 2025arXiv

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

VITA-Audio: Fast Interleaved Audio-Text Token Generation for Efficient Large Speech-Language Model

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Aligning and Prompting Everything All at Once for Universal Visual Perception