Huggingface Transformers
Written
— Updated
- source: https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt
Pipelines
=
- Full list of premade pipelines: https://huggingface.co/docs/transformers/main_classes/pipelines
- Pipelines take a
model
argument. Some models also specify a particular pipeline to use, and in that case you can omit the pipeline name. =
- Mask model for filling in words
=
Pipeline Implementation
- Each pipeline is just running the few steps needed to run a model. For example, a BERT sequence classification pipeline may do something like this:
= = # Tokenize the input = = # Run the model = = # Softmax to convert from logits to probabilities = # predictions == tensor([[4.0195e-02, 9.5980e-01], # [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>) # {0: 'NEGATIVE', 1: 'POSITIVE'} # So the first input is positive, second input is negative.
Specific Model Types
- In addition to
AutoModel
thetransformers
library provides classes for specific model types, such asBertConfig
andBertModel
. In most inference-only casesAutoModel
is fine though. - When not training a model from scratch, you will usually preload a specific config.
= =
from_pretrained
will also download the weights and related files, if needed.- When you have performed additional training, you can use
model.save_pretrained(filename)
to save the config and the weights to disk.
- In addition to
Tokenizers
- Tokenizers can both go from raw text to token IDs with
tokenizer.tokenize
andtokenizer.convert_tokens_to_ids
, and back from token IDs to words again withtokenizer.decode
, which will both convert IDs to tokens and combine subword tokens into full words. - The padding token ID can be retrieved from
tokenizer.pad_token_id
. - Attention masks can be used to tell the model to ignore certain tokens. This usually matches to the locations of the padding tokens being
0
and everything else being1
. = = =
- Depending on the tensor framework in use, you can ask the tokenizer for different types of tensors.
= # PyTorch = # TensorFlow = # NumPy
- Tokenizers support all the standard configuration
truncation=True
to truncate inputs longer than model context.max_length=16
to use a custom truncation lengthpadding=True
to pad inputs to the same length
- Tokenizers can both go from raw text to token IDs with
Training
- To train a model, you tokenize your inputs and then add an additional
labels
property which is a tensor with the expected answer for each one. = = = = = # The answers for each of the above = # Single step, usually a whole training loop would go here. = = .
- Because
bert-base-uncased
is not originally set up for sequence classification, the library will discard the original model head and add a new randomly-weighted head for sequence classification. - The
datasets
library automatically splits a dataset into training, validation, and test sets. dataset.features
describes the feature names and types, including (when applicable) the descriptions of what each label number actually means- You can use
dataset.map
to tokenize while keeping all the data in the much more efficient Apache Arrow format. It also does multiprocessing and caches results.- e.g. for a BERT next-sentence prediction:
return = # Using the collator to pad this way per batch is more efficient than padding everything to the max length across all items = = 512 =
- With that set up, you can start your training loop using the
Trainer
class, which handles all the batching, gradient descent, etc. # Can also pass `push_to_hub=True` to automatically push to Huggingface Hub when done = = # A function to report metrics at the end of each `evaluation_strategy` from the TrainingArguments = , = = return =
- Full training loop example at https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt
- Once the model has finished training, the Trainer will let you run the model.
= # { predictions: [predicted logits for each row], label_ids: [correct answers], metrics }
- To train a model, you tokenize your inputs and then add an additional