Wanli Ouyang

155

Papers

1,488

Total Citations

Papers (155)

WorldSimBench: Towards Video Generation Models as World Simulators

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

Learning Deep Structured Multi-Scale Features using Attention-Gated CRFs for Contour Prediction

NeurIPS 2017arXiv

Improving Video Generation with Human Feedback

CRF-CNN: Modeling Structured Information in Human Pose Estimation

NeurIPS 2016arXiv

Point Cloud Pre-training with Diffusion Models

HiSplat: Hierarchical 3D Gaussian Splatting for Generalizable Sparse-View Reconstruction

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation

Semi-supervised 3D Object Detection with PatchTeacher and PillarMix

WeatherGFM: Learning a Weather Generalist Foundation Model via In-context Learning

ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models

PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling

Boosting Residual Networks with Group Knowledge

MOOSE-Chem2: Exploring LLM Limits in Fine-Grained Scientific Hypothesis Discovery via Hierarchical Search

scMRDR: A scalable and flexible framework for unpaired single-cell multi-omics data integration

Multi-Modal Latent Variables for Cross-Individual Primary Visual Cortex Modeling and Analysis

SynBrain: Enhancing Visual-to-fMRI Synthesis via Probabilistic Representation Learning

CMT: A Cascade MAR with Topology Predictor for Multimodal Conditional CAD Generation

GigaGS: 3D Gaussian Based Planar Representation for Large-Scene Surface Reconstruction

STCT: Sequentially Training Convolutional Networks for Visual Tracking

End-To-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

Structured Feature Learning for Pose Estimation

Object Detection in Videos With Tubelet Proposal Networks

ViP-CNN: Visual Phrase Guided Convolutional Neural Network

Multi-Context Attention for Human Pose Estimation

Multi-Scale Continuous CRFs as Sequential Deep Networks for Monocular Depth Estimation

Learning Cross-Modal Deep Representations for Robust Pedestrian Detection

Learning Spatial Regularization With Image-Level Supervisions for Multi-Label Image Classification

Quality Aware Network for Set to Set Recognition

Style Aggregated Network for Facial Landmark Detection

PAD-Net: Multi-Tasks Guided Prediction-and-Distillation Network for Simultaneous Depth Estimation and Scene Parsing

Mask-Guided Contrastive Attention Model for Person Re-Identification

Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition

Attention-Aware Compositional Network for Person Re-Identification

Collaborative and Adversarial Network for Unsupervised Domain Adaptation

Exploit the Unknown Gradually: One-Shot Video-Based Person Re-Identification by Stepwise Learning

3D Human Pose Estimation in the Wild by Adversarial Learning

Visual Question Generation as Dual Task of Visual Question Answering

Libra R-CNN: Towards Balanced Learning for Object Detection

GS3D: An Efficient 3D Object Detection Framework for Autonomous Driving

Box-Driven Class-Wise Region Masking and Filling Rate Guided Loss for Weakly Supervised Semantic Segmentation

Hybrid Task Cascade for Instance Segmentation

Multi-Person Articulated Tracking With Spatial and Temporal Embeddings

DVC: An End-To-End Deep Video Compression Framework

Improving Action Localization by Progressive Cross-Stream Cooperation

SR-LSTM: State Refinement for LSTM Towards Pedestrian Trajectory Prediction

Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition

Multi-Dimensional Pruning: A Unified Framework for Model Compression

3D Human Mesh Regression With Dense Correspondence

EcoNAS: Finding Proxies for Economical Neural Architecture Search

Improving One-Shot NAS by Suppressing the Posterior Fading

Equalization Loss for Long-Tailed Object Recognition

Mutual CRF-GNN for Few-Shot Learning

Inception Convolution With Efficient Dilation Search

Layerwise Optimization by Gradient Decomposition for Continual Learning

Delving Into Localization Errors for Monocular 3D Object Detection

ViPNAS: Efficient Video Pose Estimation via Neural Architecture Search

Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

Accelerating Neural Network Optimization Through an Automated Control Theory Lens

Unsupervised Learning of Accurate Siamese Tracking

DR.VIC: Decomposition and Reasoning for Video Individual Counting

Not All Tokens Are Equal: Human-Centric Visual Analysis via Token Clustering Transformer

Revisiting the Transferability of Supervised Pretraining: An MLP Perspective

b-DARTS: Beta-Decay Regularization for Differentiable Architecture Search

GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds

PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models

Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator

UniHCP: A Unified Model for Human-Centric Perceptions

Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization

Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection

HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

Crossing the Gap: Domain Generalization for Image Captioning

Learning Deep Representation With Large-Scale Attributes

Visual Tracking With Fully Convolutional Networks

Multi-Task Recurrent Neural Network for Immediacy Prediction

Scene Graph Generation From Objects, Phrases and Region Captions

Learning Feature Pyramids for Human Pose Estimation

Chained Cascade Network for Object Detection

Online Multi-Object Tracking Using CNN-Based Single Object Tracker With Spatial-Temporal Attention Mechanism

Crowd Counting With Deep Structured Scale Integration Network

LAP-Net: Level-Aware Progressive Network for Image Dehazing

Structured Modeling of Joint Deep Feature and Prediction Refinement for Salient Object Detection

Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM

GradNet: Gradient-Guided Network for Visual Object Tracking

Online Hyper-Parameter Learning for Auto-Augmentation Strategy

Accurate Monocular 3D Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

AM-LFS: AutoML for Loss Function Search

TRB: A Novel Triplet Representation for Understanding 2D Human Body

GLiT: Neural Architecture Search for Global and Local Image Transformer

BN-NAS: Neural Architecture Search With Batch Normalization

Leveraging Auxiliary Tasks With Affinity Learning for Weakly Supervised Semantic Segmentation

Geometry Uncertainty Projection Network for Monocular 3D Object Detection

Evolving Search Space for Neural Architecture Search

Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images

PyMAF: 3D Human Pose and Shape Regression With Pyramidal Mesh Alignment Feedback Loop

Once Quantization-Aware Training: High Performance Extremely Low-Bit Architecture Search

Ponder: Point Cloud Pre-training via Neural Rendering

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training

NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning

Masked Motion Predictors are Strong 3D Action Representation Learners

Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups

What Can Simple Arithmetic Operations Do for Temporal Modeling?

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

Improving Deep Video Compression by Resolution-adaptive Flow Coding

Content Adaptive and Error Propagation Aware Deep Video Compression

Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection

Whole-Body Human Pose Estimation in the Wild

Rethinking Pseudo-LiDAR Representation

3D Interacting Hand Pose Estimation by Hand De-Occlusion and Removal

Pose for Everything: Towards Category-Agnostic Pose Estimation

Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking

Fast-MoCo: Boost Momentum-Based Contrastive Learning with Combinatorial Patches

Unifying Visual Contrastive Learning for Object Recognition from a Graph Perspective

Relative Contrastive Loss for Unsupervised Representation Learning

Domain Invariant Masked Autoencoders for Self-Supervised Learning from Multi-Domains

NSNet: Non-Saliency Suppression Sampler for Efficient Video Recognition

Aggregation With Feature Detection

Neuro-3D: Towards 3D Visual Decoding from EEG Signals

Satellite Observations Guided Diffusion Model for Accurate Meteorological States at Arbitrary Resolution

UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area

Frozen CLIP Transformer Is an Efficient Point Cloud Encoder

ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing

UniPAD: A Universal Pre-training Paradigm for Autonomous Driving

Point Transformer V3: Simpler Faster Stronger

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions

Taming Stable Diffusion for Text to 360 Panorama Image Generation

CasCast: Skillful High-resolution Precipitation Nowcasting via Cascaded Modelling

FiT: Flexible Vision Transformer for Diffusion Model

Towards a Self-contained Data-driven Global Weather Forecasting Framework

Saliency Detection by Multi-Context Deep Learning

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

Object Detection From Video Tubelets With Convolutional Neural Networks

Factors in Finetuning Deep Model for Object Detection With Long-Tail Distribution

Learning Deep Feature Representations With Domain Guided Dropout for Person Re-Identification

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction

Improving Auto-Augment via Augmentation-Wise Weight Sharing

A Continuous Mapping For Augmentation Design

Stimulative Training of Residual Networks: A Social Psychology Perspective of Loafing

Unsupervised Object Detection Pretraining with Joint Object Priors Generation and Detector Learning

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark

CluB: Cluster Meets BEV for LiDAR-Based 3D Object Detection

Learning to Parameterize Visual Attributes for Open-set Fine-grained Retrieval

Multi-Bias Non-linear Activation in Deep Neural Networks