Peng Gao
49
Papers
816
Total Citations
Papers (49)
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
ICLR 2024
320
citations
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
ICML 2025
88
citations
Learning Where to Focus for Efficient Video Object Detection
ECCV 2020
60
citations
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
AAAI 2024arXiv
58
citations
Lumina-Image 2.0: A Unified and Efficient Image Generative Framework
ICCV 2025
52
citations
BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation
ICLR 2024
46
citations
Digital Life Project: Autonomous 3D Characters with Social Intelligence
CVPR 2024
46
citations
EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation
NeurIPS 2025
34
citations
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
28
citations
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
CVPR 2024
27
citations
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
ICLR 2025
26
citations
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning
ICCV 2025
20
citations
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation
ICLR 2025
8
citations
Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding
CVPR 2025
3
citations
Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering
CVPR 2019
0
citations
PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022arXiv
0
citations
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
CVPR 2023arXiv
0
citations
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
CVPR 2023arXiv
0
citations
Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
CVPR 2023
0
citations
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
CVPR 2023arXiv
0
citations
Stare at What You See: Masked Image Modeling Without Reconstruction
CVPR 2023arXiv
0
citations
Multi-Modality Latent Interaction Network for Visual Question Answering
ICCV 2019
0
citations
Fast Convergence of DETR With Spatially Modulated Co-Attention
ICCV 2021
0
citations
Let's Verify and Reinforce Image Generation Step by Step
CVPR 2025
0
citations
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
ICCV 2023arXiv
0
citations
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
ICCV 2023arXiv
0
citations
SparseMAE: Sparse Training Meets Masked Autoencoders
ICCV 2023
0
citations
IDa-Det: An Information Discrepancy-Aware Distillation for 1-Bit Detectors
ECCV 2022
0
citations
Recurrent Bilinear Optimization for Binary Neural Networks
ECCV 2022
0
citations
Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation
ECCV 2022
0
citations
Frozen CLIP Models Are Efficient Video Learners
ECCV 2022
0
citations
Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification
ECCV 2022
0
citations
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023arXiv
0
citations
Spatial Preference Rewarding for MLLMs Spatial Understanding
ICCV 2025
0
citations
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
ICCV 2025
0
citations
FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process
ICCV 2025
0
citations
How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?
ICCV 2025
0
citations
A Multi-Focus-Driven Multi-Branch Network for Robust Multimodal Sentiment Analysis
AAAI 2025
0
citations
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
AAAI 2025
0
citations
OneLLM: One Framework to Align All Modalities with Language
CVPR 2024
0
citations
Masked AutoDecoder is Effective Multi-Task Vision Generalist
CVPR 2024
0
citations
InstructSpeech: Following Speech Editing Instructions via Large Language Models
ICML 2024
0
citations
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
ICML 2024
0
citations
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024
0
citations
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
ICML 2024
0
citations
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
ICML 2024
0
citations
Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
NeurIPS 2022
0
citations
Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer
NeurIPS 2022
0
citations
MCMAE: Masked Convolution Meets Masked Autoencoders
NeurIPS 2022
0
citations