Hongsheng Li
162
Papers
758
Total Citations
Papers (162)
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification
ICLR 2024
196
citations
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
ICML 2025
88
citations
GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing
NeurIPS 2025
60
citations
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
52
citations
Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow
ICLR 2025
46
citations
SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction
CVPR 2024
38
citations
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
NeurIPS 2025
34
citations
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
CVPR 2025
34
citations
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
28
citations
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
ICLR 2025
26
citations
Mixture Compressor for Mixture-of-Experts LLMs Gains More
ICLR 2025
23
citations
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
CVPR 2025
20
citations
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
ICCV 2025
17
citations
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
CVPR 2025
15
citations
Docopilot: Improving Multimodal Models for Document-Level Understanding
CVPR 2025
14
citations
DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition
ECCV 2024
12
citations
Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
ECCV 2024
10
citations
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning
NeurIPS 2025
8
citations
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation
ICLR 2025
8
citations
BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events
ECCV 2024
7
citations
Language Model Guided Interpretable Video Action Reasoning
CVPR 2024
7
citations
Delving Deep into Engagement Prediction of Short Videos
ECCV 2024
5
citations
One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation
ICML 2025
5
citations
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
CVPR 2025
3
citations
FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering
CVPR 2025
2
citations
End-To-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation
CVPR 2016
0
citations
Structured Feature Learning for Pose Estimation
CVPR 2016
0
citations
Object Detection in Videos With Tubelet Proposal Networks
CVPR 2017arXiv
0
citations
Person Search With Natural Language Description
CVPR 2017arXiv
0
citations
Learning Spatial Regularization With Image-Level Supervisions for Multi-Label Image Classification
CVPR 2017arXiv
0
citations
Single View Stereo Matching
CVPR 2018arXiv
0
citations
Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding
CVPR 2018
0
citations
Deep Group-Shuffling Random Walk for Person Re-Identification
CVPR 2018arXiv
0
citations
3D Human Pose Estimation in the Wild by Adversarial Learning
CVPR 2018arXiv
0
citations
Eliminating Background-Bias for Robust Person Re-Identification
CVPR 2018
0
citations
End-to-End Deep Kronecker-Product Matching for Person Re-Identification
CVPR 2018arXiv
0
citations
Group Consistent Similarity Learning via Deep CRF for Person Re-Identification
CVPR 2018
0
citations
PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud
CVPR 2019
0
citations
Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing
CVPR 2019
0
citations
Group-Wise Correlation Stereo Network
CVPR 2019
0
citations
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
CVPR 2019
0
citations
Conditional Adversarial Generative Flow for Controllable Image Synthesis
CVPR 2019
0
citations
P2SGrad: Refined Gradients for Optimizing Deep Face Models
CVPR 2019
0
citations
AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations
CVPR 2019
0
citations
3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior
CVPR 2020arXiv
0
citations
Robust Superpixel-Guided Attentional Adversarial Attack
CVPR 2020
0
citations
StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching
CVPR 2020arXiv
0
citations
PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection
CVPR 2020
0
citations
Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification
CVPR 2021arXiv
0
citations
Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation
CVPR 2021arXiv
0
citations
LiDAR-Based Panoptic Segmentation via Dynamic Shifting Network
CVPR 2021arXiv
0
citations
ST3D: Self-Training for Unsupervised Domain Adaptation on 3D Object Detection
CVPR 2021arXiv
0
citations
Inverting Generative Adversarial Renderer for Face Reconstruction
CVPR 2021arXiv
0
citations
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
CVPR 2021arXiv
0
citations
DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network
CVPR 2021arXiv
0
citations
Semantic Scene Completion via Integrating Instances and Scene In-the-Loop
CVPR 2021arXiv
0
citations
VS-Net: Voting With Segmentation for Visual Localization
CVPR 2021
0
citations
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks
CVPR 2022
0
citations
Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation
CVPR 2022arXiv
0
citations
IDR: Self-Supervised Image Denoising via Iterative Data Refinement
CVPR 2022arXiv
0
citations
RBGNet: Ray-Based Grouping for 3D Object Detection
CVPR 2022arXiv
0
citations
RNNPose: Recurrent 6-DoF Object Pose Refinement With Robust Correspondence Field Estimation and Pose Optimization
CVPR 2022
0
citations
AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks
CVPR 2022
0
citations
Learning a Structured Latent Space for Unsupervised Point Cloud Completion
CVPR 2022arXiv
0
citations
PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022arXiv
0
citations
A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift
CVPR 2023arXiv
0
citations
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
CVPR 2023arXiv
0
citations
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
CVPR 2023arXiv
0
citations
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023arXiv
0
citations
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
CVPR 2023arXiv
0
citations
CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching
CVPR 2023arXiv
0
citations
FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation
CVPR 2023
0
citations
PATS: Patch Area Transportation With Subdivision for Local Feature Matching
CVPR 2023arXiv
0
citations
MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers
CVPR 2023arXiv
0
citations
Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation
CVPR 2023
0
citations
ConQueR: Query Contrast Voxel-DETR for 3D Object Detection
CVPR 2023arXiv
0
citations
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
CVPR 2023arXiv
0
citations
Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels
CVPR 2023arXiv
0
citations
ReasonNet: End-to-End Driving With Temporal and Global Reasoning
CVPR 2023
0
citations
Pedestrian Travel Time Estimation in Crowded Scenes
ICCV 2015
0
citations
Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification
ICCV 2017
0
citations
Learning Feature Pyramids for Human Pose Estimation
ICCV 2017arXiv
0
citations
Identity-Aware Textual-Visual Matching With Latent Co-Attention
ICCV 2017arXiv
0
citations
Learning Deep Neural Networks for Vehicle Re-ID With Visual-Spatio-Temporal Path Proposals
ICCV 2017arXiv
0
citations
Online Multi-Object Tracking Using CNN-Based Single Object Tracker With Spatial-Temporal Attention Mechanism
ICCV 2017arXiv
0
citations
StackGAN: Text to Photo-Realistic Image Synthesis With Stacked Generative Adversarial Networks
ICCV 2017arXiv
0
citations
Interpolated Convolutional Networks for 3D Point Cloud Understanding
ICCV 2019
0
citations
Depth Completion From Sparse LiDAR Data With Depth-Normal Constraints
ICCV 2019
0
citations
CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval
ICCV 2019
0
citations
Multi-Modality Latent Interaction Network for Visual Question Answering
ICCV 2019
0
citations
Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer
ICCV 2019
0
citations
Unsupervised Domain Adaptive 3D Detection With Multi-Level Consistency
ICCV 2021arXiv
0
citations
FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting
ICCV 2021arXiv
0
citations
Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization
ICCV 2021arXiv
0
citations
Progressive Correspondence Pruning by Consensus Learning
ICCV 2021arXiv
0
citations
Rethinking Noise Synthesis and Modeling in Raw Denoising
ICCV 2021arXiv
0
citations
Let's Verify and Reinforce Image Generation Step by Step
CVPR 2025
0
citations
Encoder-Decoder With Multi-Level Attention for 3D Human Shape and Pose Estimation
ICCV 2021arXiv
0
citations
LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-Based 3D Detector
ICCV 2021
0
citations
Human Preference Score: Better Aligning Text-to-Image Models with Human Preference
ICCV 2023arXiv
0
citations
DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds
ICCV 2023arXiv
0
citations
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023arXiv
0
citations
TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses
ICCV 2023arXiv
0
citations
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space
ICCV 2023
0
citations
Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation
ICCV 2023arXiv
0
citations
Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection
ICCV 2023
0
citations
VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation
ICCV 2023arXiv
0
citations
Urban Radiance Field Representation with Deformable Neural Mesh Primitives
ICCV 2023arXiv
0
citations
GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
ICCV 2023arXiv
0
citations
Simulating Fluids in Real-World Still Images
ICCV 2023arXiv
0
citations
SparseMAE: Sparse Training Meets Masked Autoencoders
ICCV 2023
0
citations
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction
ICCV 2023arXiv
0
citations
Self-supervising Fine-grained Region Similarities for Large-scale Image Localization
ECCV 2020
0
citations
Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
ECCV 2020
0
citations
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation
ECCV 2020
0
citations
Learning to Predict Context-adaptive Convolution for Semantic Segmentation
ECCV 2020
0
citations
EfficientFCN: Holistically-guided Decoding for Semantic Segmentation
ECCV 2020
0
citations
RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax
ECCV 2020
0
citations
MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection
ECCV 2022
0
citations
EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers
ECCV 2022
0
citations
Towards Robust Face Recognition with Comprehensive Search
ECCV 2022
0
citations
FlowFormer: A Transformer Architecture for Optical Flow
ECCV 2022
0
citations
Learning Degradation Representations for Image Deblurring
ECCV 2022
0
citations
"UniNet: Unified Architecture Search with Convolution, Transformer, and MLP"
ECCV 2022
0
citations
TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers
ECCV 2022
0
citations
Frozen CLIP Models Are Efficient Video Learners
ECCV 2022
0
citations
Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification
ECCV 2022
0
citations
Fast Convergence of DETR With Spatially Modulated Co-Attention
ICCV 2021
0
citations
FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes
CVPR 2025
0
citations
GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking
CVPR 2025
0
citations
DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation
CVPR 2025
0
citations
OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation
CVPR 2025
0
citations
CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models
ICCV 2025
0
citations
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices
ICCV 2025
0
citations
HPSv3: Towards Wide-Spectrum Human Preference Score
ICCV 2025
0
citations
ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis
ICCV 2025
0
citations
M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving
AAAI 2025
0
citations
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
AAAI 2025
0
citations
GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance
AAAI 2025
0
citations
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
CVPR 2024
0
citations
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
CVPR 2024
0
citations
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
0
citations
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
CVPR 2024
0
citations
DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation
CVPR 2024
0
citations
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
ICML 2024
0
citations
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
ICML 2024
0
citations
Cross-Scene Crowd Counting via Deep Convolutional Neural Networks
CVPR 2015
0
citations
Saliency Detection by Multi-Context Deep Learning
CVPR 2015
0
citations
DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection
CVPR 2015
0
citations
Understanding Pedestrian Behaviors From Stationary Crowd Groups
CVPR 2015
0
citations
Object Detection From Video Tubelets With Convolutional Neural Networks
CVPR 2016
0
citations
Learning Deep Feature Representations With Domain Guided Dropout for Person Re-Identification
CVPR 2016
0
citations
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
NeurIPS 2022
0
citations
Controllable 3D Face Synthesis with Conditional Generative Occupancy Fields
NeurIPS 2022
0
citations
ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
NeurIPS 2022
0
citations
Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
NeurIPS 2022
0
citations
MCMAE: Masked Convolution Meets Masked Autoencoders
NeurIPS 2022
0
citations
LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios
NeurIPS 2023
0
citations
JourneyDB: A Benchmark for Generative Image Understanding
NeurIPS 2023
0
citations
A Unified Conditional Framework for Diffusion-based Image Restoration
NeurIPS 2023
0
citations
Context-PIPs: Persistent Independent Particles Demands Spatial Context Features
NeurIPS 2023
0
citations
UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene
NeurIPS 2023
0
citations