33
Papers
2,248
Total Citations
1
h-index
1
Affiliations

Affiliations

Tencent

Papers (33)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

NeurIPS 2025
1,227
citations

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CVPR 2025
858
citations

Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM

ICML 2025
103
citations

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

CVPR 2024
37
citations

SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space

AAAI 2024arXiv
13
citations

Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation

AAAI 2024arXiv
10
citations

Aligning and Prompting Everything All at Once for Universal Visual Perception

CVPR 2024
0
citations

DS-VLM: Diffusion Supervision Vision Language Model

ICML 2025
0
citations

Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training

CVPR 2019
0
citations

Filter Grafting for Deep Neural Networks

CVPR 2020arXiv
0
citations

Temporal Modulation Network for Controllable Space-Time Video Super-Resolution

CVPR 2021arXiv
0
citations

Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning

CVPR 2021arXiv
0
citations

DIFNet: Boosting Visual Information Flow for Image Captioning

CVPR 2022
0
citations

Training-Free Transformer Architecture Search

CVPR 2022arXiv
0
citations

Ask&Confirm: Active Detail Enriching for Cross-Modal Retrieval With Partial Query

ICCV 2021
0
citations

PR-Net: Preference Reasoning for Personalized Video Highlight Detection

ICCV 2021
0
citations

Learning Canonical View Representation for 3D Shape Recognition With Arbitrary Views

ICCV 2021arXiv
0
citations

Learning To Know Where To See: A Visibility-Aware Approach for Occluded Person Re-Identification

ICCV 2021
0
citations

Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration

ICCV 2023arXiv
0
citations

D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

ICCV 2023arXiv
0
citations

Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval

ICCV 2023
0
citations

Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians

ECCV 2020
0
citations

Efficient Decoder-Free Object Detection with Transformers

ECCV 2022
0
citations

DisCo: Remedying Self-Supervised Learning on Lightweight Models with Distilled Contrastive Learning

ECCV 2022
0
citations

PAC-Net: Highlight Your Video via History Preference Modeling

ECCV 2022
0
citations

Learning 3D Shape Feature for Texture-Insensitive Person Re-Identification

CVPR 2021
0
citations

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

ICLR 2025
0
citations

Probability-Density-aware Semi-supervised Learning

AAAI 2025
0
citations

Visual Hallucination Elevates Speech Recognition

AAAI 2024
0
citations

A General and Efficient Training for Transformer via Token Expansion

CVPR 2024
0
citations

HRVDA: High-Resolution Visual Document Assistant

CVPR 2024
0
citations

Pruning Filter in Filter

NeurIPS 2020
0
citations

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes

NeurIPS 2023
0
citations