Model Compression
Written
— Updated
Quantization
- Just compressing the weights down to smaller sizes. FP16 -> fp8/int8, or even 4 bits.
Pruning
- A lot of work has gone into model pruning, the idea that a large portion of the weights in a trained model are redundant or value-less and thus can be completely removed. This makes the model smaller and faster.
- Fine-tuning post-prune can help to recover some of the lost accuracy that may come from pruning.
- SparseZoo is a repository of pre-pruned models.
- Papers
- [[1803.03635] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635)
- [[2007.12223] The Lottery Ticket Hypothesis for Pre-trained BERT Networks](https://arxiv.org/abs/2007.12223)
- [[2211.03013] Robust Lottery Tickets for Pre-trained Language Models](https://arxiv.org/abs/2211.03013)
- OSS
- Commercial Offerings