"model compression" Papers
58 papers found • Page 1 of 2
CASP: Compression of Large Multimodal Models Based on Attention Sparsity
Mohsen Gholami, Mohammad Akbari, Kevin Cannons et al.
Computation and Memory-Efficient Model Compression with Gradient Reweighting
Zhiwei Li, Yuesen Liao, Binrui Wu et al.
DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models
Yongqi Huang, Peng Ye, Chenyu Huang et al.
Dynamical Low-Rank Compression of Neural Networks with Robustness under Adversarial Attacks
Steffen Schotthöfer, Lexie Yang, Stefan Schnake
EMLoC: Emulator-based Memory-efficient Fine-tuning with LoRA Correction
Hsi-Che Lin, Yu-Chu Yu, Kai-Po Chang et al.
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation
Zhuguanyu Wu, Shihe Wang, Jiayi Zhang et al.
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee, Haebin Seong, Dong Bok Lee et al.
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
Yuxian Gu, Qinghao Hu, Haocheng Xi et al.
Layer as Puzzle Pieces: Compressing Large Language Models through Layer Concatenation
Fei Wang, Li Shen, Liang Ding et al.
LLaMaFlex: Many-in-one LLMs via Generalized Pruning and Weight Sharing
Ruisi Cai, Saurav Muralidharan, Hongxu Yin et al.
MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics
Bowei Guo, Shengkun Tang, Cong Zeng et al.
Optimal Brain Apoptosis
Mingyuan Sun, Zheng Fang, Jiaxu Wang et al.
Quantized Spike-driven Transformer
Xuerui Qiu, Malu Zhang, Jieyuan Zhang et al.
RSAVQ: Riemannian Sensitivity-Aware Vector Quantization for Large Language Models
Zukang Xu, Xing Hu, Qiang Wu et al.
SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
Rasoul Shafipour, David Harrison, Maxwell Horton et al.
SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking
Xingrun Xing, Boyan Gao, Zheng Liu et al.
Systematic Outliers in Large Language Models
Yongqi An, Xu Zhao, Tao Yu et al.
The Unreasonable Ineffectiveness of the Deeper Layers
Andrey Gromov, Kushal Tirumala, Hassan Shapourian et al.
TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks
Xiang Meng, Mehdi Makni, Rahul Mazumder
An Empirical Study of CLIP for Text-Based Person Search
Cao Min, Yang Bai, ziyin Zeng et al.
Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
Yeonhong Park, Jake Hyun, SangLyul Cho et al.
Bayesian Knowledge Distillation: A Bayesian Perspective of Distillation with Uncertainty Quantification
Luyang Fang, Yongkai Chen, Wenxuan Zhong et al.
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs
Wei Huang, Yangdong Liu, Haotong Qin et al.
BiPFT: Binary Pre-trained Foundation Transformer with Low-Rank Estimation of Binarization Residual Polynomials
Xingrun Xing, Li Du, Xinyuan Wang et al.
Building Variable-Sized Models via Learngene Pool
Boyu Shi, Shiyu Xia, Xu Yang et al.
CHAI: Clustered Head Attention for Efficient LLM Inference
Saurabh Agarwal, Bilge Acun, Basil Hosmer et al.
Compressing Large Language Models by Joint Sparsification and Quantization
Jinyang Guo, Jianyu Wu, Zining Wang et al.
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Junyuan Hong, Jinhao Duan, Chenhui Zhang et al.
DFD: Distilling the Feature Disparity Differently for Detectors
Kang Liu, Yingyi Zhang, Jingyun Zhang et al.
DistiLLM: Towards Streamlined Distillation for Large Language Models
Jongwoo Ko, Sungnyun Kim, Tianyi Chen et al.
Do Topological Characteristics Help in Knowledge Distillation?
Jungeun Kim, Junwon You, Dongjin Lee et al.
DεpS: Delayed ε-Shrinking for Faster Once-For-All Training
Aditya Annavajjala, Alind Khare, Animesh Agrawal et al.
Enhancing Vision Transformer: Amplifying Non-Linearity in Feedforward Network Module
Yixing Xu, Chao Li, Dong Li et al.
Entropy Induced Pruning Framework for Convolutional Neural Networks
Yiheng Lu, Ziyu Guan, Yaming Yang et al.
EPSD: Early Pruning with Self-Distillation for Efficient Model Compression
Dong Chen, Ning Liu, Yichen Zhu et al.
ERQ: Error Reduction for Post-Training Quantization of Vision Transformers
Yunshan Zhong, Jiawei Hu, You Huang et al.
ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking
Wenshuo Li, Xinghao Chen, Han Shu et al.
Exploring Intrinsic Dimension for Vision-Language Model Pruning
Hanzhang Wang, Jiawen Zhang, Qingyuan Ma
Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Zhewei Yao, Xiaoxia Wu, Cheng Li et al.
Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev et al.
Flextron: Many-in-One Flexible Large Language Model
Ruisi Cai, Saurav Muralidharan, Greg Heinrich et al.
Fluctuation-Based Adaptive Structured Pruning for Large Language Models
Yongqi An, Xu Zhao, Tao Yu et al.
FrameQuant: Flexible Low-Bit Quantization for Transformers
Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang et al.
Generative Model-Based Feature Knowledge Distillation for Action Recognition
Guiqin Wang, Peng Zhao, Yanjiang Shi et al.
Good Teachers Explain: Explanation-Enhanced Knowledge Distillation
Amin Parchami, Moritz Böhle, Sukrut Rao et al.
Junk DNA Hypothesis: Pruning Small Pre-Trained Weights $\textit{Irreversibly}$ and $\textit{Monotonically}$ Impairs ``Difficult" Downstream Tasks in LLMs
Lu Yin, Ajay Jaiswal, Shiwei Liu et al.
KernelWarehouse: Rethinking the Design of Dynamic Convolution
Chao Li, Anbang Yao
Lightweight Image Super-Resolution via Flexible Meta Pruning
Yulun Zhang, Kai Zhang, Luc Van Gool et al.
Localizing Task Information for Improved Model Merging and Compression
Ke Wang, Nikolaos Dimitriadis, Guillermo Ortiz-Jimenez et al.
OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models
Changhun Lee, Jungyu Jin, Taesu Kim et al.