Xing Sun

Google Scholar OpenReview

33

Papers

2,248

Total Citations

1

h-index

1

Affiliations

Affiliations

Tencent

Papers (33)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation

Aligning and Prompting Everything All at Once for Universal Visual Perception

DS-VLM: Diffusion Supervision Vision Language Model

Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training

Filter Grafting for Deep Neural Networks

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning

DIFNet: Boosting Visual Information Flow for Image Captioning

Training-Free Transformer Architecture Search

Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval With Partial Query

PR-Net: Preference Reasoning for Personalized Video Highlight Detection

Learning Canonical View Representation for 3D Shape Recognition With Arbitrary Views

Learning To Know Where To See: A Visibility-Aware Approach for Occluded Person Re-Identification

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval

Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians

Efficient Decoder-Free Object Detection with Transformers

DisCo: Remedying Self-Supervised Learning on Lightweight Models with Distilled Contrastive Learning

PAC-Net: Highlight Your Video via History Preference Modeling

Learning 3D Shape Feature for Texture-Insensitive Person Re-Identification

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Probability-Density-aware Semi-supervised Learning

Visual Hallucination Elevates Speech Recognition

A General and Efficient Training for Transformer via Token Expansion

HRVDA: High-Resolution Visual Document Assistant

Pruning Filter in Filter

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes