Ping Luo

149

Papers

4,598

Total Citations

Papers (149)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

Generalized Predictive Model for Autonomous Driving

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

AnalogCoder: Analog Circuit Design via Training-Free Code Generation

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Goku: Flow Based Video Generative Foundation Models

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

End-to-End Autonomous Driving Through V2X Cooperation

Webly Supervised Image Classification with Self-Contained Confidence

AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Cached Transformers: Improving Transformers with Differentiable Memory Cached

UniFS: Universal Few-shot Instance Perception with Point Representations

NADER: Neural Architecture Design via Multi-Agent Collaboration

JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

BOOD: Boundary-based Out-Of-Distribution Data Generation

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

Learning a Reinforced Agent for Flexible Exposure Bracketing Selection

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

3D Human Mesh Regression With Dense Correspondence

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

Online Knowledge Distillation via Collaborative Learning

Exemplar Normalization for Learning Deep Representation

PolarMask: Single Shot Instance Segmentation With Polar Representation

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

Parser-Free Virtual Try-On via Distilling Appearance Flows

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

HR-NAS: Searching Efficient High-Resolution Neural Architectures With Lightweight Transformers

Bridging Video-Text Retrieval With Multiple Choice Questions

RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs

Language As Queries for Referring Video Object Segmentation

Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

Scale-Equivalent Distillation for Semi-Supervised Object Detection

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

Accelerating Vision-Language Pretraining With Free Language Modeling

Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention

Universal Instance Perception As Object Discovery and Retrieval

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

EC2: Emergent Communication for Embodied Control

Real-Time Controllable Denoising for Image and Video

Policy Adaptation From Foundation Model Feedback

Dense Distinct Query for End-to-End Object Detection

Semantic Image Segmentation via Deep Parsing Network

Deep Learning Strong Parts for Pedestrian Detection

Learning Social Relation Traits From Face Images

From Facial Parts Responses to Face Detection: A Deep Learning Approach

Deep Learning Face Attributes in the Wild

Deep Dual Learning for Semantic Image Segmentation

Vision-Infused Deep Audio Inpainting

Switchable Whitening for Deep Representation Learning

CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid

Deep Self-Learning From Noisy Labels

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

DetCo: Unsupervised Contrastive Learning for Object Detection

Adversarial Robustness for Unsupervised Domain Adaptation

Watch Only Once: An End-to-End Video Action Detection Framework

Bringing Events Into Video Deblurring With Non-Consecutively Blurry Frames

STAR: A Structure-Aware Lightweight Transformer for Real-Time Image Enhancement

End-to-End Dense Video Captioning With Parallel Decoding

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

Scene as Occupancy

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

Segment Every Reference Object in Spatial and Temporal Spaces

Beyond One-to-One: Rethinking the Referring Image Segmentation

Going Denser with Open-Vocabulary Part Segmentation

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

DDP: Diffusion Model for Dense Visual Prediction

Exploring Transformers for Open-world Instance Segmentation

DiffusionDet: Diffusion Model for Object Detection

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

Whole-Body Human Pose Estimation in the Wild

Segmenting Transparent Objects in the Wild

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction

PoseTrans: A Simple yet Effective Pose Transformation Augmentation for Human Pose Estimation

3D Interacting Hand Pose Estimation by Hand De-Occlusion and Removal

Pose for Everything: Towards Category-Agnostic Pose Estimation

Towards Grand Unification of Object Tracking

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

DaViT: Dual Attention Vision Transformers

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

MangaNinja: Line Art Colorization with Precise Reference Following

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

GenTron: Diffusion Transformers for Image and Video Generation

RegionGPT: Towards Region Understanding Vision Language Model

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

Position: Towards Implicit Prompt For Text-To-Image Models

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

A Large-Scale Car Dataset for Fine-Grained Categorization and Verification

Pedestrian Detection Aided by Deep Learning Semantic Tasks

DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations

WIDER FACE: A Face Detection Benchmark

Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade

Learning Object Interactions and Descriptions for Semantic Image Segmentation

FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis

SSN: Learning Sparse Switchable Normalization via SparsestMax

Kalman Normalization: Normalizing Internal Representations Across Network Layers

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

Model-Based Reinforcement Learning via Imagination with Derived Memory

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

Compressed Video Contrastive Learning

Rethinking the Pruning Criteria for Convolutional Neural Network

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

Rethinking Resolution in the Context of Efficient Video Recognition

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

Foundation Model is Efficient Multimodal Multitask Model Selector

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Top-Ambiguity Samples Matter: Understanding Why Deep Ensemble Works in Selective Classification

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Learning Deep Architectures via Generalized Whitened Neural Networks

Differentiable Dynamic Normalization for Learning Deep Representation