Kaipeng Zhang

20

Papers

541

Total Citations

Papers (20)

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation

Neighboring Autoregressive Modeling for Efficient Visual Generation

REPA Works Until It Doesn’t: Early-Stopped, Holistic Alignment Supercharges Diffusion Training

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification

OneLLM: One Framework to Align All Modalities with Language

ZipVL: Accelerating Vision-Language Models through Dynamic Token Sparsity

LiT: Delving into a Simple Linear Diffusion Transformer for Image Generation

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Position: Towards Implicit Prompt For Text-To-Image Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

Detecting Faces Using Inside Cascaded Contextual CNN

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP without Training

Neural Routing by Memory

Foundation Model is Efficient Multimodal Multitask Model Selector