Xing Sun
33
Papers
2,248
Total Citations
1
h-index
1
Affiliations
Affiliations
Tencent
Papers (33)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
NeurIPS 2025
1,227
citations
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
CVPR 2025
858
citations
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
ICML 2025
103
citations
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models
CVPR 2024
37
citations
SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space
AAAI 2024arXiv
13
citations
Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation
AAAI 2024arXiv
10
citations
Aligning and Prompting Everything All at Once for Universal Visual Perception
CVPR 2024
0
citations
DS-VLM: Diffusion Supervision Vision Language Model
ICML 2025
0
citations
Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training
CVPR 2019
0
citations
Filter Grafting for Deep Neural Networks
CVPR 2020arXiv
0
citations
Temporal Modulation Network for Controllable Space-Time Video Super-Resolution
CVPR 2021arXiv
0
citations
Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning
CVPR 2021arXiv
0
citations
DIFNet: Boosting Visual Information Flow for Image Captioning
CVPR 2022
0
citations
Training-Free Transformer Architecture Search
CVPR 2022arXiv
0
citations
Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval With Partial Query
ICCV 2021
0
citations
PR-Net: Preference Reasoning for Personalized Video Highlight Detection
ICCV 2021
0
citations
Learning Canonical View Representation for 3D Shape Recognition With Arbitrary Views
ICCV 2021arXiv
0
citations
Learning To Know Where To See: A Visibility-Aware Approach for Occluded Person Re-Identification
ICCV 2021
0
citations
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration
ICCV 2023arXiv
0
citations
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
ICCV 2023arXiv
0
citations
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval
ICCV 2023
0
citations
Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians
ECCV 2020
0
citations
Efficient Decoder-Free Object Detection with Transformers
ECCV 2022
0
citations
DisCo: Remedying Self-Supervised Learning on Lightweight Models with Distilled Contrastive Learning
ECCV 2022
0
citations
PAC-Net: Highlight Your Video via History Preference Modeling
ECCV 2022
0
citations
Learning 3D Shape Feature for Texture-Insensitive Person Re-Identification
CVPR 2021
0
citations
Learning Interleaved Image-Text Comprehension in Vision-Language Large Models
ICLR 2025
0
citations
Probability-Density-aware Semi-supervised Learning
AAAI 2025
0
citations
Visual Hallucination Elevates Speech Recognition
AAAI 2024
0
citations
A General and Efficient Training for Transformer via Token Expansion
CVPR 2024
0
citations
HRVDA: High-Resolution Visual Document Assistant
CVPR 2024
0
citations
Pruning Filter in Filter
NeurIPS 2020
0
citations
CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes
NeurIPS 2023
0
citations