Hongsheng Li

162
Papers
758
Total Citations

Papers (162)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

ICLR 2024
196
citations

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

ICML 2025
88
citations

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

NeurIPS 2025
60
citations

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025
52
citations

Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

ICLR 2025
46
citations

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

CVPR 2024
38
citations

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

NeurIPS 2025
34
citations

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

CVPR 2025
34
citations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ICCV 2025
28
citations

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

ICLR 2025
26
citations

Mixture Compressor for Mixture-of-Experts LLMs Gains More

ICLR 2025
23
citations

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

CVPR 2025
20
citations

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

ICCV 2025
17
citations

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

CVPR 2025
15
citations

Docopilot: Improving Multimodal Models for Document-Level Understanding

CVPR 2025
14
citations

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

ECCV 2024
12
citations

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

ECCV 2024
10
citations

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

NeurIPS 2025
8
citations

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

ICLR 2025
8
citations

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

ECCV 2024
7
citations

Language Model Guided Interpretable Video Action Reasoning

CVPR 2024
7
citations

Delving Deep into Engagement Prediction of Short Videos

ECCV 2024
5
citations

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

ICML 2025
5
citations

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

CVPR 2025
3
citations

FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering

CVPR 2025
2
citations

End-To-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

CVPR 2016
0
citations

Structured Feature Learning for Pose Estimation

CVPR 2016
0
citations

Object Detection in Videos With Tubelet Proposal Networks

CVPR 2017arXiv
0
citations

Person Search With Natural Language Description

CVPR 2017arXiv
0
citations

Learning Spatial Regularization With Image-Level Supervisions for Multi-Label Image Classification

CVPR 2017arXiv
0
citations

Single View Stereo Matching

CVPR 2018arXiv
0
citations

Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding

CVPR 2018
0
citations

Deep Group-Shuffling Random Walk for Person Re-Identification

CVPR 2018arXiv
0
citations

3D Human Pose Estimation in the Wild by Adversarial Learning

CVPR 2018arXiv
0
citations

Eliminating Background-Bias for Robust Person Re-Identification

CVPR 2018
0
citations

End-to-End Deep Kronecker-Product Matching for Person Re-Identification

CVPR 2018arXiv
0
citations

Group Consistent Similarity Learning via Deep CRF for Person Re-Identification

CVPR 2018
0
citations

PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud

CVPR 2019
0
citations

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

CVPR 2019
0
citations

Group-Wise Correlation Stereo Network

CVPR 2019
0
citations

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

CVPR 2019
0
citations

Conditional Adversarial Generative Flow for Controllable Image Synthesis

CVPR 2019
0
citations

P2SGrad: Refined Gradients for Optimizing Deep Face Models

CVPR 2019
0
citations

AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

CVPR 2019
0
citations

3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior

CVPR 2020arXiv
0
citations

Robust Superpixel-Guided Attentional Adversarial Attack

CVPR 2020
0
citations

StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching

CVPR 2020arXiv
0
citations

PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

CVPR 2020
0
citations

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

CVPR 2021arXiv
0
citations

Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation

CVPR 2021arXiv
0
citations

LiDAR-Based Panoptic Segmentation via Dynamic Shifting Network

CVPR 2021arXiv
0
citations

ST3D: Self-Training for Unsupervised Domain Adaptation on 3D Object Detection

CVPR 2021arXiv
0
citations

Inverting Generative Adversarial Renderer for Face Reconstruction

CVPR 2021arXiv
0
citations

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

CVPR 2021arXiv
0
citations

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

CVPR 2021arXiv
0
citations

Semantic Scene Completion via Integrating Instances and Scene In-the-Loop

CVPR 2021arXiv
0
citations

VS-Net: Voting With Segmentation for Visual Localization

CVPR 2021
0
citations

Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks

CVPR 2022
0
citations

Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation

CVPR 2022arXiv
0
citations

IDR: Self-Supervised Image Denoising via Iterative Data Refinement

CVPR 2022arXiv
0
citations

RBGNet: Ray-Based Grouping for 3D Object Detection

CVPR 2022arXiv
0
citations

RNNPose: Recurrent 6-DoF Object Pose Refinement With Robust Correspondence Field Estimation and Pose Optimization

CVPR 2022
0
citations

AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks

CVPR 2022
0
citations

Learning a Structured Latent Space for Unsupervised Point Cloud Completion

CVPR 2022arXiv
0
citations

PointCLIP: Point Cloud Understanding by CLIP

CVPR 2022arXiv
0
citations

A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift

CVPR 2023arXiv
0
citations

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

CVPR 2023arXiv
0
citations

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

CVPR 2023arXiv
0
citations

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

CVPR 2023arXiv
0
citations

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

CVPR 2023arXiv
0
citations

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

CVPR 2023arXiv
0
citations

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

CVPR 2023
0
citations

PATS: Patch Area Transportation With Subdivision for Local Feature Matching

CVPR 2023arXiv
0
citations

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

CVPR 2023arXiv
0
citations

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation

CVPR 2023
0
citations

ConQueR: Query Contrast Voxel-DETR for 3D Object Detection

CVPR 2023arXiv
0
citations

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

CVPR 2023arXiv
0
citations

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels

CVPR 2023arXiv
0
citations

ReasonNet: End-to-End Driving With Temporal and Global Reasoning

CVPR 2023
0
citations

Pedestrian Travel Time Estimation in Crowded Scenes

ICCV 2015
0
citations

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification

ICCV 2017
0
citations

Learning Feature Pyramids for Human Pose Estimation

ICCV 2017arXiv
0
citations

Identity-Aware Textual-Visual Matching With Latent Co-Attention

ICCV 2017arXiv
0
citations

Learning Deep Neural Networks for Vehicle Re-ID With Visual-Spatio-Temporal Path Proposals

ICCV 2017arXiv
0
citations

Online Multi-Object Tracking Using CNN-Based Single Object Tracker With Spatial-Temporal Attention Mechanism

ICCV 2017arXiv
0
citations

StackGAN: Text to Photo-Realistic Image Synthesis With Stacked Generative Adversarial Networks

ICCV 2017arXiv
0
citations

Interpolated Convolutional Networks for 3D Point Cloud Understanding

ICCV 2019
0
citations

Depth Completion From Sparse LiDAR Data With Depth-Normal Constraints

ICCV 2019
0
citations

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

ICCV 2019
0
citations

Multi-Modality Latent Interaction Network for Visual Question Answering

ICCV 2019
0
citations

Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer

ICCV 2019
0
citations

Unsupervised Domain Adaptive 3D Detection With Multi-Level Consistency

ICCV 2021arXiv
0
citations

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

ICCV 2021arXiv
0
citations

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

ICCV 2021arXiv
0
citations

Progressive Correspondence Pruning by Consensus Learning

ICCV 2021arXiv
0
citations

Rethinking Noise Synthesis and Modeling in Raw Denoising

ICCV 2021arXiv
0
citations

Let's Verify and Reinforce Image Generation Step by Step

CVPR 2025
0
citations

Encoder-Decoder With Multi-Level Attention for 3D Human Shape and Pose Estimation

ICCV 2021arXiv
0
citations

LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-Based 3D Detector

ICCV 2021
0
citations

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

ICCV 2023arXiv
0
citations

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

ICCV 2023arXiv
0
citations

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

ICCV 2023arXiv
0
citations

TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

ICCV 2023arXiv
0
citations

NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

ICCV 2023
0
citations

Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation

ICCV 2023arXiv
0
citations

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

ICCV 2023
0
citations

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

ICCV 2023arXiv
0
citations

Urban Radiance Field Representation with Deformable Neural Mesh Primitives

ICCV 2023arXiv
0
citations

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

ICCV 2023arXiv
0
citations

Simulating Fluids in Real-World Still Images

ICCV 2023arXiv
0
citations

SparseMAE: Sparse Training Meets Masked Autoencoders

ICCV 2023
0
citations

Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction

ICCV 2023arXiv
0
citations

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

ECCV 2020
0
citations

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

ECCV 2020
0
citations

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

ECCV 2020
0
citations

Learning to Predict Context-adaptive Convolution for Semantic Segmentation

ECCV 2020
0
citations

EfficientFCN: Holistically-guided Decoding for Semantic Segmentation

ECCV 2020
0
citations

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

ECCV 2020
0
citations

MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

ECCV 2022
0
citations

EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers

ECCV 2022
0
citations

Towards Robust Face Recognition with Comprehensive Search

ECCV 2022
0
citations

FlowFormer: A Transformer Architecture for Optical Flow

ECCV 2022
0
citations

Learning Degradation Representations for Image Deblurring

ECCV 2022
0
citations

"UniNet: Unified Architecture Search with Convolution, Transformer, and MLP"

ECCV 2022
0
citations

TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers

ECCV 2022
0
citations

Frozen CLIP Models Are Efficient Video Learners

ECCV 2022
0
citations

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

ECCV 2022
0
citations

Fast Convergence of DETR With Spatially Modulated Co-Attention

ICCV 2021
0
citations

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

CVPR 2025
0
citations

GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking

CVPR 2025
0
citations

DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

CVPR 2025
0
citations

OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

CVPR 2025
0
citations

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

ICCV 2025
0
citations

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

ICCV 2025
0
citations

HPSv3: Towards Wide-Spectrum Human Preference Score

ICCV 2025
0
citations

ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

ICCV 2025
0
citations

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

AAAI 2025
0
citations

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

AAAI 2025
0
citations

GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

AAAI 2025
0
citations

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

CVPR 2024
0
citations

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

CVPR 2024
0
citations

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

CVPR 2024
0
citations

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

CVPR 2024
0
citations

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

CVPR 2024
0
citations

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

ICML 2024
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Cross-Scene Crowd Counting via Deep Convolutional Neural Networks

CVPR 2015
0
citations

Saliency Detection by Multi-Context Deep Learning

CVPR 2015
0
citations

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

CVPR 2015
0
citations

Understanding Pedestrian Behaviors From Stationary Crowd Groups

CVPR 2015
0
citations

Object Detection From Video Tubelets With Convolutional Neural Networks

CVPR 2016
0
citations

Learning Deep Feature Representations With Domain Guided Dropout for Person Re-Identification

CVPR 2016
0
citations

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

NeurIPS 2022
0
citations

Controllable 3D Face Synthesis with Conditional Generative Occupancy Fields

NeurIPS 2022
0
citations

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

NeurIPS 2022
0
citations

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

NeurIPS 2022
0
citations

MCMAE: Masked Convolution Meets Masked Autoencoders

NeurIPS 2022
0
citations

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

NeurIPS 2023
0
citations

JourneyDB: A Benchmark for Generative Image Understanding

NeurIPS 2023
0
citations

A Unified Conditional Framework for Diffusion-based Image Restoration

NeurIPS 2023
0
citations

Context-PIPs: Persistent Independent Particles Demands Spatial Context Features

NeurIPS 2023
0
citations

UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene

NeurIPS 2023
0
citations