Huggingface Transformers

Written — Updated
  • source: https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt
  • Pipelines

    • from transformers import pipeline
      
      classifier = pipeline("sentiment-analysis")
      
      
    • Full list of premade pipelines: https://huggingface.co/docs/transformers/main_classes/pipelines
    • Pipelines take a model argument. Some models also specify a particular pipeline to use, and in that case you can omit the pipeline name.
    • generator = pipeline("text-generation", model="distilgpt2")
      
    • Mask model for filling in words
    • from transformers import pipeline
      
      unmasker = pipeline("fill-mask")
      unmasker("This course will teach you all about <mask> models.", top_k=2)
      
      
    • Pipeline Implementation

    • Each pipeline is just running the few steps needed to run a model. For example, a BERT sequence classification pipeline may do something like this:
      • from transformers import AutoTokenizer, AutoModelForSequenceClassification
        import torch
        
        text_inputs = [
            "I've been waiting for a HuggingFace course my whole life.",
            "I hate this so much!",
        ]
        checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
        
        # Tokenize the input
        tokenizer = AutoTokenizer.from_pretrained(checkpoint)
        inputs = tokenizer(text_inputs, padding=True, truncation=True, return_tensors="pt")
        
        # Run the model
        model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
        outputs = model(**inputs)
        
        # Softmax to convert from logits to probabilities
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        # predictions == tensor([[4.0195e-02, 9.5980e-01],
        #        [9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward>)
        
        model.config.id2label
        # {0: 'NEGATIVE', 1: 'POSITIVE'}
        # So the first input is positive, second input is negative.
        
  • Specific Model Types

    • In addition to AutoModel the transformers library provides classes for specific model types, such as BertConfig and BertModel. In most inference-only cases AutoModel is fine though.
    • When not training a model from scratch, you will usually preload a specific config.
    • model = BertModel.from_pretrained("bert-base-cased")
      tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
      
    • from_pretrained will also download the weights and related files, if needed.
    • When you have performed additional training, you can use model.save_pretrained(filename) to save the config and the weights to disk.
  • Tokenizers

    • Tokenizers can both go from raw text to token IDs with tokenizer.tokenize and tokenizer.convert_tokens_to_ids, and back from token IDs to words again with tokenizer.decode, which will both convert IDs to tokens and combine subword tokens into full words.
    • The padding token ID can be retrieved from tokenizer.pad_token_id.
    • Attention masks can be used to tell the model to ignore certain tokens. This usually matches to the locations of the padding tokens being 0 and everything else being 1.
    • ids = torch.tensor([...])
      attention_mask = torch.tensor([...])
      
      outputs = model(ids, attention_mask=attention_mask)
      
    • Depending on the tensor framework in use, you can ask the tokenizer for different types of tensors.
    • inputs = tokenizer(texts, return_tensor="pt") # PyTorch
      inputs = tokenizer(texts, return_tensor="tf") # TensorFlow
      inputs = tokenizer(texts, return_tensor="np") # NumPy
      
    • Tokenizers support all the standard configuration
      • truncation=True to truncate inputs longer than model context. max_length=16 to use a custom truncation length
      • padding=True to pad inputs to the same length
  • Training

    • To train a model, you tokenize your inputs and then add an additional labels property which is a tensor with the expected answer for each one.
    • import torch
      from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification
      
      checkpoint = "bert-base-uncased"
      tokenizer = AutoTokenizer.from_pretrained(checkpoint)
      model = AutoModelForSequenceClassification(checkpoint)
      
      sequences = ["string1", "string2", "etc"]
      batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
      
      # The answers for each of the above
      batch["labels"] = torch.tensor([1, 0, 1])
      
      # Single step, usually a whole training loop would go here.
      optimizer = AdamW(model.parameters())
      loss = model(**batch).loss
      loss.backward()
      optimizer.step()
      
    • Because bert-base-uncased is not originally set up for sequence classification, the library will discard the original model head and add a new randomly-weighted head for sequence classification.
    • The datasets library automatically splits a dataset into training, validation, and test sets.
    • dataset.features describes the feature names and types, including (when applicable) the descriptions of what each label number actually means
    • You can use dataset.map to tokenize while keeping all the data in the much more efficient Apache Arrow format. It also does multiprocessing and caches results.
      • e.g. for a BERT next-sentence prediction:
      • from transformers import DataCollatorWithPadding
        
        def tokenize(row):
          return tokenizer(row["sentence1"], row["sentence2"], truncation=True)
        
        tokenized = dataset.map(tokenize, batched=True)
        
        # Using the collator to pad this way per batch is more efficient than padding everything to the max length across all items
        collator = DataCollatorWithPadding(tokenizer=tokenizer)
        batch_size = 512
        samples = collator(tokenized["train"][:batch_size])
        
    • With that set up, you can start your training loop using the Trainer class, which handles all the batching, gradient descent, etc.
    • from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification
      import evaluate
      # Can also pass `push_to_hub=True` to automatically push to Huggingface Hub when done
      training_args = TrainingArguments("directory-to-save-to", evaluation_strategy="epoch")
      
      model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
      
      # A function to report metrics at the end of each `evaluation_strategy` from the TrainingArguments
      def compute_metrics(eval_preds):
      	metric = evaluate.load(same arguments that loaded the dataset)
          logits, labels = eval_preds
          predictions = np.argmax(logits, axis=-1)
          return metric.compute(predictions=predictions, references=labels)
      
      trainer = Trainer(
      	model,
      	training_args,
      	train_dataset=tokenized["train"],
      	eval_dataset=tokenized["validation"],
        	# This can be skipped if using a DataCollatorWithPadding since that's the default when omitted.
      	data_collator=collator,
      	tokenizer=tokenizer,
        	compute_metrics=compute_metrics
      )
      
      trainer.train()
      
    • Full training loop example at https://huggingface.co/learn/nlp-course/chapter3/4?fw=pt
    • Once the model has finished training, the Trainer will let you run the model.
    • predictions = trainer.predict(tokenized["validation"])
      # { predictions: [predicted logits for each row], label_ids: [correct answers], metrics }
      

Thanks for reading! If you have any questions or comments, please send me a note on Twitter.