Hongsheng Li

162

Papers

758

Total Citations

Papers (162)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Mixture Compressor for Mixture-of-Experts LLMs Gains More

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Docopilot: Improving Multimodal Models for Document-Level Understanding

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

Language Model Guided Interpretable Video Action Reasoning

Delving Deep into Engagement Prediction of Short Videos

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering

End-To-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

Structured Feature Learning for Pose Estimation

Object Detection in Videos With Tubelet Proposal Networks

Person Search With Natural Language Description

Learning Spatial Regularization With Image-Level Supervisions for Multi-Label Image Classification

Single View Stereo Matching

Video Person Re-Identification With Competitive Snippet-Similarity Aggregation and Co-Attentive Snippet Embedding

Deep Group-Shuffling Random Walk for Person Re-Identification

3D Human Pose Estimation in the Wild by Adversarial Learning

Eliminating Background-Bias for Robust Person Re-Identification

End-to-End Deep Kronecker-Product Matching for Person Re-Identification

Group Consistent Similarity Learning via Deep CRF for Person Re-Identification

PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Group-Wise Correlation Stereo Network

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Conditional Adversarial Generative Flow for Controllable Image Synthesis

P2SGrad: Refined Gradients for Optimizing Deep Face Models

AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

3D Sketch-Aware Semantic Scene Completion via Semi-Supervised Structure Prior

Robust Superpixel-Guided Attentional Adversarial Attack

StereoGAN: Bridging Synthetic-to-Real Domain Gap by Joint Optimization of Domain Translation and Stereo Matching

PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation

LiDAR-Based Panoptic Segmentation via Dynamic Shifting Network

ST3D: Self-Training for Unsupervised Domain Adaptation on 3D Object Detection

Inverting Generative Adversarial Renderer for Face Reconstruction

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

Semantic Scene Completion via Integrating Instances and Scene In-the-Loop

VS-Net: Voting With Segmentation for Visual Localization

Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks

Weakly Supervised Temporal Action Localization via Representative Snippet Knowledge Propagation

IDR: Self-Supervised Image Denoising via Iterative Data Refinement

RBGNet: Ray-Based Grouping for 3D Object Detection

RNNPose: Recurrent 6-DoF Object Pose Refinement With Robust Correspondence Field Estimation and Pose Optimization

AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks

Learning a Structured Latent Space for Unsupervised Point Cloud Completion

PointCLIP: Point Cloud Understanding by CLIP

A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching

FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation

PATS: Patch Area Transportation With Subdivision for Local Feature Matching

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation

ConQueR: Query Contrast Voxel-DETR for 3D Object Detection

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels

ReasonNet: End-to-End Driving With Temporal and Global Reasoning

Pedestrian Travel Time Estimation in Crowded Scenes

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification

Learning Feature Pyramids for Human Pose Estimation

Identity-Aware Textual-Visual Matching With Latent Co-Attention

Learning Deep Neural Networks for Vehicle Re-ID With Visual-Spatio-Temporal Path Proposals

Online Multi-Object Tracking Using CNN-Based Single Object Tracker With Spatial-Temporal Attention Mechanism

StackGAN: Text to Photo-Realistic Image Synthesis With Stacked Generative Adversarial Networks

Interpolated Convolutional Networks for 3D Point Cloud Understanding

Depth Completion From Sparse LiDAR Data With Depth-Normal Constraints

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Multi-Modality Latent Interaction Network for Visual Question Answering

Semi-Supervised Monocular 3D Face Reconstruction With End-to-End Shape-Preserved Domain Transfer

Unsupervised Domain Adaptive 3D Detection With Multi-Level Consistency

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

Progressive Correspondence Pruning by Consensus Learning

Rethinking Noise Synthesis and Modeling in Raw Denoising

Let's Verify and Reinforce Image Generation Step by Step

Encoder-Decoder With Multi-Level Attention for 3D Human Shape and Pose Estimation

LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-Based 3D Detector

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses

NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space

Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation

Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

Urban Radiance Field Representation with Deformable Neural Mesh Primitives

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Simulating Fluids in Real-World Still Images

SparseMAE: Sparse Training Meets Masked Autoencoders

Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Learning to Predict Context-adaptive Convolution for Semantic Segmentation

EfficientFCN: Holistically-guided Decoding for Semantic Segmentation

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

MPPNet: Multi-Frame Feature Intertwining with Proxy Points for 3D Temporal Object Detection

EdgeViTs: Competing Light-Weight CNNs on Mobile Devices with Vision Transformers

Towards Robust Face Recognition with Comprehensive Search

FlowFormer: A Transformer Architecture for Optical Flow

Learning Degradation Representations for Image Deblurring

"UniNet: Unified Architecture Search with Convolution, Transformer, and MLP"

TokenMix: Rethinking Image Mixing for Data Augmentation in Vision Transformers

Frozen CLIP Models Are Efficient Video Learners

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

Fast Convergence of DETR With Spatially Modulated Co-Attention

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking

DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

HPSv3: Towards Wide-Spectrum Human Preference Score

ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Cross-Scene Crowd Counting via Deep Convolutional Neural Networks

Saliency Detection by Multi-Context Deep Learning

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

Understanding Pedestrian Behaviors From Stationary Crowd Groups

Object Detection From Video Tubelets With Convolutional Neural Networks

Learning Deep Feature Representations With Domain Guided Dropout for Person Re-Identification

Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs

Controllable 3D Face Synthesis with Conditional Generative Occupancy Fields

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

MCMAE: Masked Convolution Meets Masked Autoencoders

LightZero: A Unified Benchmark for Monte Carlo Tree Search in General Sequential Decision Scenarios

JourneyDB: A Benchmark for Generative Image Understanding

A Unified Conditional Framework for Diffusion-based Image Restoration

Context-PIPs: Persistent Independent Particles Demands Spatial Context Features

UE4-NeRF:Neural Radiance Field for Real-Time Rendering of Large-Scale Scene