Xin Li

124

Papers

2,328

Total Citations

1

Affiliations

Affiliations

Tencent Youtu Lab

Papers (124)

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

Cascade Graph Neural Networks for RGB-D Salient Object Detection

DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Multi-Space Alignments Towards Universal LiDAR Segmentation

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal Alignment

KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling

Commonsense Prototype for Outdoor Unsupervised 3D Object Detection

USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

Grab What You Need: Rethinking Complex Table Structure Recognition with Flexible Components Deliberation

V2X-R: Cooperative LiDAR-4D Radar Fusion with Denoising Diffusion for 3D Object Detection

MobileInst: Video Instance Segmentation on the Mobile

Surf-D: Generating High-Quality Surfaces of Arbitrary Topologies Using Diffusion Models

CADDreamer: CAD Object Generation from Single-view Images

Inverse Weight-Balancing for Deep Long-Tailed Learning

MetaCARD: Meta-Reinforcement Learning with Task Uncertainty Feedback via Decoupled Context-Aware Reward and Dynamics Components

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Symbolic Neural Ordinary Differential Equations

MetaAT: Active Testing for Label-Efficient Evaluation of Dense Recognition Tasks

RaSS: Improving Denoising Diffusion Samplers with Reinforced Active Sampling Scheduler

Learning Latent Dynamic Robust Representations for World Models

A Unified Adaptive Testing System Enabled by Hierarchical Structure Search

Simplified Mirror-Based Camera Pose Computation via Rotation Averaging

Object-Aware Dense Semantic Correspondence

NM-Net: Mining Reliable Neighbors for Robust Feature Correspondences

Target-Aware Deep Tracking

RF-Net: An End-To-End Image Matching Network Based on Receptive Field

LO-Net: Deep Real-Time Lidar Odometry

Partial Order Pruning: For Best Speed/Accuracy Trade-Off in Neural Architecture Search

Probabilistic Model Distillation for Semantic Correspondence

Learning Semantic Person Image Generation by Region-Adaptive Normalization

Mutual Graph Learning for Camouflaged Object Detection

Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer

Multi-Object Tracking Meets Moving UAV

Learning Optical Flow With Kernel Patch Attention

Unsupervised Learning of Accurate Siamese Tracking

Towards Bidirectional Arbitrary Image Rescaling: Joint Optimization and Cycle Idempotence

DirecFormer: A Directed Attention in Transformer Approach to Robust Action Recognition

NomMer: Nominate Synergistic Context in Vision Transformer for Visual Recognition

Neural Collaborative Graph Machines for Table Structure Recognition

SCPNet: Semantic Scene Completion on Point Cloud

Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective

LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion

Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring

Micron-BERT: BERT-Based Facial Micro-Expression Recognition

Vector Quantization With Self-Attention for Quality-Independent Representation Learning

Virtual Sparse Convolution for Multimodal 3D Object Detection

Low-Rank Tensor Approximation With Laplacian Scale Mixture Modeling for Multiframe Image Denoising

3D Fragment Reassembly Using Integrated Template Guidance and Fracture-Region Matching

Semi-Supervised Zero-Shot Classification With Label Representation Learning

FoveaNet: Perspective-Aware Urban Scene Parsing

SBGAR: Semantics Based Group Activity Recognition

Video Scene Parsing With Predictive Feature Learning

Adversarial Examples Detection in Deep Networks With Convolutional Filter Statistics

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

Semantics-Enhanced Adversarial Nets for Text-to-Image Synthesis

Paint Transformer: Feed Forward Neural Painting With Stroke Prediction

Saliency-Associated Object Tracking

AdaAttN: Revisit Attention Mechanism in Arbitrary Neural Style Transfer

Uncertainty-Guided Transformer Reasoning for Camouflaged Object Detection

CoIn: Contrastive Instance Feature Mining for Outdoor 3D Object Detection with Very Limited Annotations

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

Surface Extraction from Neural Unsigned Distance Fields

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

Low-Light Image Enhancement with Multi-Stage Residue Quantization and Brightness-Aware Attention

Batch-based Model Registration for Fast 3D Sherd Reconstruction

Fast Full-frame Video Stabilization with Iterative Optimization

LMR: A Large-Scale Multi-Reference Dataset for Reference-Based Super-Resolution

Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells

CiteTracker: Correlating Image and Text for Visual Tracking

LIRA: Lifelong Image Restoration from Unknown Blended Distortions

DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition

Sparse-to-Dense Depth Completion Revisited: Sampling Strategy and Graph Construction

Learning Disentangled Feature Representation for Hybrid-distorted Image Restoration

Uncertainty Learning in Kernel Estimation for Multi-stage Blind Image Super-Resolution

Neural Color Operators for Sequential Image Retouching

RRSR:Reciprocal Reference-Based Image Super-Resolution with Progressive Feature Alignment and Selection

Self-Feature Distillation with Uncertainty Modeling for Degraded Image Recognition

Homogeneous Multi-modal Feature Fusion and Interaction for 3D Object Detection

Learning Parametric Sparse Models for Image Super-Resolution

GAFlow: Incorporating Gaussian Attention into Optical Flow

ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

Breaking the Memory Barrier of Contrastive Loss via Tile-Based Strategy

Parameterized Blur Kernel Prior Learning for Local Motion Deblurring

Gain from Neighbors: Boosting Model Robustness in the Wild via Adversarial Perturbations Toward Neighboring Classes

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Motal: Unsupervised 3D Object Detection by Modality and Task-specific Knowledge Transfer

Controllable 3D Outdoor Scene Generation via Scene Graphs

ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

CoStoDet-DDPM: Collaborative Training of Stochastic and Deterministic Models Improves Surgical Workflow Anticipation and Recognition

Multi-Perspective Consolidation Enhanced Cognitive Diagnosis via Conditional Diffusion Model

Training-Free Image Manipulation Localization Using Diffusion Models

Automated Creation of Reusable and Diverse Toolsets for Enhancing LLM Reasoning

Sunshine to Rainstorm: Cross-Weather Knowledge Distillation for Robust 3D Object Detection

Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision

Improving GNN Calibration with Discriminative Ability: Insights and Strategies

Pushing the Limit of Fine-Tuning for Few-Shot Learning: Where Feature Reusing Meets Cross-Scale Attention

SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking

Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis?

SeD: Semantic-Aware Discriminator for Image Super-Resolution

RTracker: Recoverable Tracking via PN Tree Structured Memory

KVQ: Kwai Video Quality Assessment for Short-form Videos

HRVDA: High-Resolution Visual Document Assistant

HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection

From Fourier to Neural ODEs: Flow Matching for Modeling Complex Systems

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement

Uncertainty-Driven Loss for Single Image Super-Resolution

DeepReduce: A Sparse-tensor Communication Framework for Federated Deep Learning

Discrete Compositional Representations as an Abstraction for Goal Conditioned Reinforcement Learning

AttCAT: Explaining Transformers via Attentive Class Activation Tokens

UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models

A Bounded Ability Estimation for Computerized Adaptive Testing

GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph

Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning

GradOrth: A Simple yet Efficient Out-of-Distribution Detection with Orthogonal Projection of Gradients

From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Models to Pre-trained Machine Reader