Chaoyou Fu

Google Scholar OpenReview

18

Papers

2,433

Total Citations

17

h-index

Papers (18)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

NeurIPS 2025arXiv

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

Rethinking Image Cropping: Exploring Diverse Compositions From Global Views

CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification

InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Aligning and Prompting Everything All at Once for Universal Visual Perception

Cross-Spectral Face Hallucination via Disentangling Independent Factors

Information Bottleneck Disentanglement for Identity Swapping

Pareidolia Face Reenactment

Dual Variational Generation for Low Shot Heterogeneous Face Recognition

AOT: Appearance Optimal Transport Based Identity Swapping for Forgery Detection

Multi-modal Queried Object Detection in the Wild

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes