Huggingface Transformers

Written 2023-05-14 — Updated 2023-05-15

source: https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt

Pipelines

from transformers import pipeline

classifier = pipeline("sentiment-analysis")

Full list of premade pipelines: https://huggingface.co/docs/transformers/main_classes/pipelines
Pipelines take a model argument. Some models also specify a particular pipeline to use, and in that case you can omit the pipeline name.

generator = pipeline("text-generation", model="distilgpt2")

Mask model for filling in words

from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=2)

Pipeline Implementation

Each pipeline is just running the few steps needed to run a model. For example, a BERT sequence classification pipeline may do something like this:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

text_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

# Tokenize the input
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(text_inputs, padding=True, truncation=True, return_tensors="pt")

# Run the model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)

# Softmax to convert from logits to probabilities
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
# predictions == tensor([[4.0195e-02, 9.5980e-01],
#        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)

model.config.id2label
# {0: 'NEGATIVE', 1: 'POSITIVE'}
# So the first input is positive, second input is negative.

Specific Model Types
- In addition to AutoModel the transformers library provides classes for specific model types, such as BertConfig and BertModel. In most inference-only cases AutoModel is fine though.
- When not training a model from scratch, you will usually preload a specific config.
- ```
model = BertModel.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
```
- from_pretrained will also download the weights and related files, if needed.
- When you have performed additional training, you can use model.save_pretrained(filename) to save the config and the weights to disk.
Tokenizers
- Tokenizers can both go from raw text to token IDs with tokenizer.tokenize and tokenizer.convert_tokens_to_ids, and back from token IDs to words again with tokenizer.decode, which will both convert IDs to tokens and combine subword tokens into full words.
- The padding token ID can be retrieved from tokenizer.pad_token_id.
- Attention masks can be used to tell the model to ignore certain tokens. This usually matches to the locations of the padding tokens being 0 and everything else being 1.
- ```
ids = torch.tensor([...])
attention_mask = torch.tensor([...])

outputs = model(ids, attention_mask=attention_mask)
```
- Depending on the tensor framework in use, you can ask the tokenizer for different types of tensors.
- ```
inputs = tokenizer(texts, return_tensor="pt") # PyTorch
inputs = tokenizer(texts, return_tensor="tf") # TensorFlow
inputs = tokenizer(texts, return_tensor="np") # NumPy
```
- Tokenizers support all the standard configuration
  - truncation=True to truncate inputs longer than model context. max_length=16 to use a custom truncation length
  - padding=True to pad inputs to the same length

Training

To train a model, you tokenize your inputs and then add an additional labels property which is a tensor with the expected answer for each one.

import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification(checkpoint)

sequences = ["string1", "string2", "etc"]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# The answers for each of the above
batch["labels"] = torch.tensor([1, 0, 1])

# Single step, usually a whole training loop would go here.
optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Because bert-base-uncased is not originally set up for sequence classification, the library will discard the original model head and add a new randomly-weighted head for sequence classification.
The datasets library automatically splits a dataset into training, validation, and test sets.
dataset.features describes the feature names and types, including (when applicable) the descriptions of what each label number actually means

You can use dataset.map to tokenize while keeping all the data in the much more efficient Apache Arrow format. It also does multiprocessing and caches results.

e.g. for a BERT next-sentence prediction:

from transformers import DataCollatorWithPadding

def tokenize(row):
  return tokenizer(row["sentence1"], row["sentence2"], truncation=True)

tokenized = dataset.map(tokenize, batched=True)

# Using the collator to pad this way per batch is more efficient than padding everything to the max length across all items
collator = DataCollatorWithPadding(tokenizer=tokenizer)
batch_size = 512
samples = collator(tokenized["train"][:batch_size])

With that set up, you can start your training loop using the Trainer class, which handles all the batching, gradient descent, etc.

from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification
import evaluate
# Can also pass `push_to_hub=True` to automatically push to Huggingface Hub when done
training_args = TrainingArguments("directory-to-save-to", evaluation_strategy="epoch")

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# A function to report metrics at the end of each `evaluation_strategy` from the TrainingArguments
def compute_metrics(eval_preds):
	metric = evaluate.load(same arguments that loaded the dataset)
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
	model,
	training_args,
	train_dataset=tokenized["train"],
	eval_dataset=tokenized["validation"],
  	# This can be skipped if using a DataCollatorWithPadding since that's the default when omitted.
	data_collator=collator,
	tokenizer=tokenizer,
  	compute_metrics=compute_metrics
)

trainer.train()

Full training loop example at https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt
Once the model has finished training, the Trainer will let you run the model.

predictions = trainer.predict(tokenized["validation"])
# { predictions: [predicted logits for each row], label_ids: [correct answers], metrics }

Huggingface Transformers

Pipelines

Pipeline Implementation

Specific Model Types

Tokenizers

Training