Model Compression

Written 2023-05-02 — Updated 2023-05-03

Quantization
- Just compressing the weights down to smaller sizes. FP16 -> fp8/int8, or even 4 bits.
Pruning
- A lot of work has gone into model pruning, the idea that a large portion of the weights in a trained model are redundant or value-less and thus can be completely removed. This makes the model smaller and faster.
- Fine-tuning post-prune can help to recover some of the lost accuracy that may come from pruning.
- SparseZoo is a repository of pre-pruned models.
- Papers
  - [[1803.03635] The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks](https://arxiv.org/abs/1803.03635)
  - [[2007.12223] The Lottery Ticket Hypothesis for Pre-trained BERT Networks](https://arxiv.org/abs/2007.12223)
  - [[2211.03013] Robust Lottery Tickets for Pre-trained Language Models](https://arxiv.org/abs/2211.03013)
- OSS
  - Speedster
- Commercial Offerings
  - MosaicML
  - Software-Delivered AI - Neural Magic

Quantization