Peng Gao

49

Papers

816

Total Citations

Papers (49)

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Learning Where to Focus for Efficient Video Object Detection

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

Digital Life Project: Autonomous 3D Characters with Social Intelligence

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

PointCLIP: Point Cloud Understanding by CLIP

Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners

Starting From Non-Parametric Networks for 3D Point Cloud Analysis

Q-DETR: An Efficient Low-Bit Quantized Detection Transformer

Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders

Stare at What You See: Masked Image Modeling Without Reconstruction

Multi-Modality Latent Interaction Network for Visual Question Answering

Fast Convergence of DETR With Spatially Modulated Co-Attention

Let's Verify and Reinforce Image Generation Step by Step

PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement

SparseMAE: Sparse Training Meets Masked Autoencoders

IDa-Det: An Information Discrepancy-Aware Distillation for 1-Bit Detectors

Recurrent Bilinear Optimization for Binary Neural Networks

Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

Frozen CLIP Models Are Efficient Video Learners

Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

Spatial Preference Rewarding for MLLMs Spatial Understanding

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

FontAnimate: High Quality Few-shot Font Generation via Animating Font Transfer Process

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

A Multi-Focus-Driven Multi-Branch Network for Robust Multimodal Sentiment Analysis

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

OneLLM: One Framework to Align All Modalities with Language

Masked AutoDecoder is Effective Multi-Task Vision Generalist

InstructSpeech: Following Speech Editing Instructions via Large Language Models

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

MCMAE: Masked Convolution Meets Masked Autoencoders