Ping Luo

149
Papers
4,598
Total Citations

Papers (149)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CVPR 2024
2,210
citations

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

CVPR 2024
864
citations

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

ICLR 2024
408
citations

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

ICLR 2024
320
citations

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

ECCV 2020
138
citations

Generalized Predictive Model for Autonomous Driving

CVPR 2024
122
citations

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

ICCV 2025
96
citations

AnalogCoder: Analog Circuit Design via Training-Free Code Generation

AAAI 2025
79
citations

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

ICML 2025
72
citations

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

CVPR 2024
64
citations

Goku: Flow Based Video Generative Foundation Models

CVPR 2025arXiv
53
citations

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

ICLR 2024
46
citations

End-to-End Autonomous Driving Through V2X Cooperation

AAAI 2025
44
citations

Webly Supervised Image Classification with Self-Contained Confidence

ECCV 2020
16
citations

AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

AAAI 2025
14
citations

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

ICCV 2025
10
citations

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

CVPR 2025
10
citations

IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model

ICLR 2025
7
citations

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

CVPR 2024
7
citations

Cached Transformers: Improving Transformers with Differentiable Memory Cached

AAAI 2024arXiv
5
citations

UniFS: Universal Few-shot Instance Perception with Point Representations

ECCV 2024
3
citations

NADER: Neural Architecture Design via Multi-Agent Collaboration

CVPR 2025
3
citations

JiSAM: Alleviate Labeling Burden and Corner Case Problems in Autonomous Driving via Minimal Real-World Data

CVPR 2025
2
citations

BOOD: Boundary-based Out-Of-Distribution Data Generation

ICML 2025
2
citations

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

NeurIPS 2025
2
citations

DETree: DEtecting Human-AI Collaborative Texts via Tree-Structured Hierarchical Representation Learning

NeurIPS 2025
1
citations

DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

CVPR 2019
0
citations

Learning a Reinforced Agent for Flexible Exposure Bracketing Selection

CVPR 2020arXiv
0
citations

MaskGAN: Towards Diverse and Interactive Facial Image Manipulation

CVPR 2020arXiv
0
citations

3D Human Mesh Regression With Dense Correspondence

CVPR 2020arXiv
0
citations

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

CVPR 2020
0
citations

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

CVPR 2020arXiv
0
citations

Online Knowledge Distillation via Collaborative Learning

CVPR 2020
0
citations

Exemplar Normalization for Learning Deep Representation

CVPR 2020arXiv
0
citations

PolarMask: Single Shot Instance Segmentation With Polar Representation

CVPR 2020arXiv
0
citations

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

CVPR 2021arXiv
0
citations

Disentangled Cycle Consistency for Highly-Realistic Virtual Try-On

CVPR 2021arXiv
0
citations

Sparse R-CNN: End-to-End Object Detection With Learnable Proposals

CVPR 2021
0
citations

Parser-Free Virtual Try-On via Distilling Appearance Flows

CVPR 2021arXiv
0
citations

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

CVPR 2021arXiv
0
citations

HR-NAS: Searching Efficient High-Resolution Neural Architectures With Lightweight Transformers

CVPR 2021
0
citations

Bridging Video-Text Retrieval With Multiple Choice Questions

CVPR 2022arXiv
0
citations

RestoreFormer: High-Quality Blind Face Restoration From Undegraded Key-Value Pairs

CVPR 2022arXiv
0
citations

Language As Queries for Referring Video Object Segmentation

CVPR 2022arXiv
0
citations

Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer

CVPR 2022arXiv
0
citations

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

CVPR 2022arXiv
0
citations

Scale-Equivalent Distillation for Semi-Supervised Object Detection

CVPR 2022arXiv
0
citations

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

CVPR 2022arXiv
0
citations

Accelerating Vision-Language Pretraining With Free Language Modeling

CVPR 2023arXiv
0
citations

Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention

CVPR 2023arXiv
0
citations

Universal Instance Perception As Object Discovery and Retrieval

CVPR 2023arXiv
0
citations

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

CVPR 2023
0
citations

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

CVPR 2023
0
citations

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

CVPR 2023arXiv
0
citations

EC2: Emergent Communication for Embodied Control

CVPR 2023
0
citations

Real-Time Controllable Denoising for Image and Video

CVPR 2023arXiv
0
citations

Policy Adaptation From Foundation Model Feedback

CVPR 2023arXiv
0
citations

Dense Distinct Query for End-to-End Object Detection

CVPR 2023arXiv
0
citations

Semantic Image Segmentation via Deep Parsing Network

ICCV 2015
0
citations

Deep Learning Strong Parts for Pedestrian Detection

ICCV 2015
0
citations

Learning Social Relation Traits From Face Images

ICCV 2015
0
citations

From Facial Parts Responses to Face Detection: A Deep Learning Approach

ICCV 2015
0
citations

Deep Learning Face Attributes in the Wild

ICCV 2015
0
citations

Deep Dual Learning for Semantic Image Segmentation

ICCV 2017
0
citations

Vision-Infused Deep Audio Inpainting

ICCV 2019
0
citations

Switchable Whitening for Deep Representation Learning

ICCV 2019
0
citations

CamNet: Coarse-to-Fine Retrieval for Camera Re-Localization

ICCV 2019
0
citations

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

ICCV 2019
0
citations

Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid

ICCV 2019
0
citations

Deep Self-Learning From Noisy Labels

ICCV 2019
0
citations

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

ICCV 2021arXiv
0
citations

DetCo: Unsupervised Contrastive Learning for Object Detection

ICCV 2021arXiv
0
citations

Adversarial Robustness for Unsupervised Domain Adaptation

ICCV 2021arXiv
0
citations

Watch Only Once: An End-to-End Video Action Detection Framework

ICCV 2021
0
citations

Bringing Events Into Video Deblurring With Non-Consecutively Blurry Frames

ICCV 2021
0
citations

STAR: A Structure-Aware Lightweight Transformer for Real-Time Image Enhancement

ICCV 2021
0
citations

End-to-End Dense Video Captioning With Parallel Decoding

ICCV 2021arXiv
0
citations

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

ICCV 2023arXiv
0
citations

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

ICCV 2023
0
citations

Scene as Occupancy

ICCV 2023arXiv
0
citations

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

ICCV 2023arXiv
0
citations

Segment Every Reference Object in Spatial and Temporal Spaces

ICCV 2023
0
citations

Beyond One-to-One: Rethinking the Referring Image Segmentation

ICCV 2023
0
citations

Going Denser with Open-Vocabulary Part Segmentation

ICCV 2023arXiv
0
citations

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

ICCV 2023arXiv
0
citations

DDP: Diffusion Model for Dense Visual Prediction

ICCV 2023arXiv
0
citations

Exploring Transformers for Open-world Instance Segmentation

ICCV 2023arXiv
0
citations

DiffusionDet: Diffusion Model for Object Detection

ICCV 2023arXiv
0
citations

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

ECCV 2020
0
citations

Whole-Body Human Pose Estimation in the Wild

ECCV 2020
0
citations

Segmenting Transparent Objects in the Wild

ECCV 2020
0
citations

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

ECCV 2020
0
citations

Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction

ECCV 2020
0
citations

PoseTrans: A Simple yet Effective Pose Transformation Augmentation for Human Pose Estimation

ECCV 2022
0
citations

3D Interacting Hand Pose Estimation by Hand De-Occlusion and Removal

ECCV 2022
0
citations

Pose for Everything: Towards Category-Agnostic Pose Estimation

ECCV 2022
0
citations

Towards Grand Unification of Object Tracking

ECCV 2022
0
citations

ByteTrack: Multi-Object Tracking by Associating Every Detection Box

ECCV 2022
0
citations

DaViT: Dual Attention Vision Transformers

ECCV 2022
0
citations

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

ECCV 2022
0
citations

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

ECCV 2022
0
citations

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

ICCV 2019
0
citations

DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

CVPR 2025
0
citations

MangaNinja: Line Art Colorization with Precise Reference Following

CVPR 2025
0
citations

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

CVPR 2025
0
citations

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

CVPR 2025
0
citations

Unsupervised Continual Domain Shift Learning with Multi-Prototype Modeling

CVPR 2025
0
citations

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

CVPR 2025
0
citations

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

CVPR 2025
0
citations

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

ICCV 2025
0
citations

GenTron: Diffusion Transformers for Image and Video Generation

CVPR 2024
0
citations

RegionGPT: Towards Region Understanding Vision Language Model

CVPR 2024
0
citations

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

CVPR 2024
0
citations

Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary

ICML 2024
0
citations

Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

ICML 2024
0
citations

Position: Towards Implicit Prompt For Text-To-Image Models

ICML 2024
0
citations

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

ICML 2024
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

CVPR 2015
0
citations

A Large-Scale Car Dataset for Fine-Grained Categorization and Verification

CVPR 2015
0
citations

Pedestrian Detection Aided by Deep Learning Semantic Tasks

CVPR 2015
0
citations

DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations

CVPR 2016
0
citations

WIDER FACE: A Face Detection Benchmark

CVPR 2016
0
citations

Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade

CVPR 2017arXiv
0
citations

Learning Object Interactions and Descriptions for Semantic Image Segmentation

CVPR 2017
0
citations

FaceID-GAN: Learning a Symmetry Three-Player GAN for Identity-Preserving Face Synthesis

CVPR 2018
0
citations

SSN: Learning Sparse Switchable Normalization via SparsestMax

CVPR 2019
0
citations

Kalman Normalization: Normalizing Internal Representations Across Network Layers

NeurIPS 2018
0
citations

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

NeurIPS 2021
0
citations

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

NeurIPS 2021
0
citations

Model-Based Reinforcement Learning via Imagination with Derived Memory

NeurIPS 2021
0
citations

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

NeurIPS 2021
0
citations

Compressed Video Contrastive Learning

NeurIPS 2021
0
citations

Rethinking the Pruning Criteria for Convolutional Neural Network

NeurIPS 2021
0
citations

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

NeurIPS 2022
0
citations

Large-batch Optimization for Dense Visual Predictions: Training Faster R-CNN in 4.2 Minutes

NeurIPS 2022
0
citations

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

NeurIPS 2022
0
citations

DOMINO: Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

NeurIPS 2022
0
citations

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

NeurIPS 2022
0
citations

Rethinking Resolution in the Context of Efficient Video Recognition

NeurIPS 2022
0
citations

OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

NeurIPS 2023
0
citations

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

NeurIPS 2023
0
citations

Foundation Model is Efficient Multimodal Multitask Model Selector

NeurIPS 2023
0
citations

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

NeurIPS 2023
0
citations

Top-Ambiguity Samples Matter: Understanding Why Deep Ensemble Works in Selective Classification

NeurIPS 2023
0
citations

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

NeurIPS 2023
0
citations

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

NeurIPS 2023
0
citations

Learning Deep Architectures via Generalized Whitened Neural Networks

ICML 2017
0
citations

Differentiable Dynamic Normalization for Learning Deep Representation

ICML 2019
0
citations