Peng Gao

49
Papers
816
Total Citations

Papers (49)

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

ICLR 2024
320
citations

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

ICML 2025
88
citations

Learning Where to Focus for Efficient Video Object Detection

ECCV 2020
60
citations

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

AAAI 2024arXiv
58
citations

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

ICCV 2025
52
citations

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

ICLR 2024
46
citations

Digital Life Project: Autonomous 3D Characters with Social Intelligence

CVPR 2024
46
citations

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

NeurIPS 2025
34
citations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ICCV 2025
28
citations

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

CVPR 2024
27
citations

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

ICLR 2025
26
citations

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

ICCV 2025
20
citations

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

ICLR 2025
8
citations

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

CVPR 2025
3
citations

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

CVPR 2019
0
citations

PointCLIP: Point Cloud Understanding by CLIP

CVPR 2022arXiv
0
citations

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

CVPR 2023arXiv
0
citations

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

CVPR 2023arXiv
0
citations

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

CVPR 2023
0
citations

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

CVPR 2023arXiv
0
citations

Stare at What You See: Masked Image Modeling Without Reconstruction

CVPR 2023arXiv
0
citations

Multi-Modality Latent Interaction Network for Visual Question Answering

ICCV 2019
0
citations

Fast Convergence of DETR With Spatially Modulated Co-Attention

ICCV 2021
0
citations

Let's Verify and Reinforce Image Generation Step by Step

CVPR 2025
0
citations

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

ICCV 2023arXiv
0
citations

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

ICCV 2023arXiv
0
citations

SparseMAE: Sparse Training Meets Masked Autoencoders

ICCV 2023
0
citations

IDa-Det: An Information Discrepancy-Aware Distillation for 1-Bit Detectors

ECCV 2022
0
citations

Recurrent Bilinear Optimization for Binary Neural Networks

ECCV 2022
0
citations

Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

ECCV 2022
0
citations

Frozen CLIP Models Are Efficient Video Learners

ECCV 2022
0
citations

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

ECCV 2022
0
citations

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

ICCV 2023arXiv
0
citations

Spatial Preference Rewarding for MLLMs Spatial Understanding

ICCV 2025
0
citations

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

ICCV 2025
0
citations

FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process

ICCV 2025
0
citations

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

ICCV 2025
0
citations

A Multi-Focus-Driven Multi-Branch Network for Robust Multimodal Sentiment Analysis

AAAI 2025
0
citations

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

AAAI 2025
0
citations

OneLLM: One Framework to Align All Modalities with Language

CVPR 2024
0
citations

Masked AutoDecoder is Effective Multi-Task Vision Generalist

CVPR 2024
0
citations

InstructSpeech: Following Speech Editing Instructions via Large Language Models

ICML 2024
0
citations

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

ICML 2024
0
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

ICML 2024
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

NeurIPS 2022
0
citations

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

NeurIPS 2022
0
citations

MCMAE: Masked Convolution Meets Masked Autoencoders

NeurIPS 2022
0
citations