Hongsheng Li

44

Papers

758

Total Citations

Papers (44)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Mixture Compressor for Mixture-of-Experts LLMs Gains More

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving

Docopilot: Improving Multimodal Models for Document-Level Understanding

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

Language Model Guided Interpretable Video Action Reasoning

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

Delving Deep into Engagement Prediction of Short Videos

One Leaf Reveals the Season: Occlusion-Based Contrastive Learning with Semantic-Aware Views for Efficient Visual Representation

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

FlexDrive: Toward Trajectory Flexibility in Driving Scene Gaussian Splatting Reconstruction and Rendering

ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

HPSv3: Towards Wide-Spectrum Human Preference Score

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

CameraCtrl II: Dynamic Scene Exploration via Camera-controlled Video Diffusion Models

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

GaussianPainter: Painting Point Cloud into 3D Gaussians with Normal Guidance

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

LMDrive: Closed-Loop End-to-End Driving with Large Language Models

OPTICAL: Leveraging Optimal Transport for Contribution Allocation in Dataset Distillation

DirectTriGS: Triplane-based Gaussian Splatting Field Representation for 3D Generation

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

GS-DiT: Advancing Video Generation with Dynamic 3D Gaussian Fields through Efficient Dense 3D Point Tracking

Let's Verify and Reinforce Image Generation Step by Step

FreeSim: Toward Free-viewpoint Camera Simulation in Driving Scenes

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models