Xin Li

52

Papers

2,216

Total Citations

1

Affiliations

Affiliations

Tencent Youtu Lab

Papers (52)

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Multi-Space Alignments Towards Universal LiDAR Segmentation

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

MobileInst: Video Instance Segmentation on the Mobile

Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation

V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection

CADDreamer: CAD Object Generation from Single-view Images

Inverse Weight-Balancing for Deep Long-Tailed Learning

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components

Symbolic Neural Ordinary Differential Equations

RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler

MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

Learning Latent Dynamic Robust Representations for World Models

A Unified Adaptive Testing System Enabled by Hierarchical Structure Search

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy

Parameterized Blur Kernel Prior Learning for Local Motion Deblurring

Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer

Controllable 3D Outdoor Scene Generation via Scene Graphs

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

Multi-Perspective Consolidation Enhanced Cognitive Diagnosis via Conditional Diffusion Model

Training-Free Image Manipulation Localization Using Diffusion Models

Automated Creation of Reusable and Diverse Toolsets for Enhancing LLM Reasoning

Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection

Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision

Improving GNN Calibration with Discriminative Ability: Insights and Strategies

Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention

SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking

Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?

SeD: Semantic-Aware Discriminator for Image Super-Resolution

RTracker: Recoverable Tracking via PN Tree Structured Memory

KVQ: Kwai Video Quality Assessment for Short-form Videos

HRVDA: High-Resolution Visual Document Assistant

HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection

From Fourier to Neural ODEs: Flow Matching for Modeling Complex Systems