Model Architectures

Written 2023-05-03 — Updated 2023-05-08

GPT
- This refers to the original Generative Pretrained model from the OpenAI paper. Most of the below models would also fall roughly under the GPT category.
- GPT models (like most of the models below) are made up of transformers.
- The big breakthrough from the paper was around its 2-step training regimen
  - First, generating a base model from unsupervised training on a large corpus of various text.
  - Then, finetuning the base model for specific tasks in a later step.
- In both steps, the model is trained for next-word prediction.
BERT
- BERT models differ from the earlier GPT models in that their attention layers look at all tokens in the context instead of just the tokens that came earlier. This improves sentence-level comprehension.
- BERT is trained in two ways:
  - Cloze deletion
  - Given a sentence, predict the next sentence.
- Notably, BERT's decoder is not autoregressive. That is, while GPT models use the generated output so far when determining the next word, BERT generates the entire output only from the input. Since it is trained on next sentence prediction and cloze deletion this works well, but limits BERT to shorter output.
- BERT tends to be good at classification, question answering, sentiment analysis.
BART
- BART model training takes a sentence, corrupts the sentence, and then trains the model to learn the corrected sentence.
- BART does use an autoregressive decoder, so it is better at generating longer outputs.
- It tends to outperform BERT on longer passages, tasks such as summarization and text generation. It also tends to be better with noisy text as it has been trained specifically to handle that.