# Modeling Baselines

Motivating the use of baselines for iterative modeling.
Goku Mohandas
· ·
Repository · Notebook

📬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

## Intuition

Baselines are simple benchmarks which pave the way for iterative development:

• Rapid experimentation via hyperparameter tuning thanks to low model complexity.
• Discovery of data issues, false assumptions, bugs in code, etc. since model itself is not complex.
• Pareto's principle: we can achieve decent performance with minimal initial effort.

## Process

Here is the high level approach to establishing baselines:

1. Start with the simplest possible baseline to compare subsequent development with. This is often a random (chance) model.
2. Develop a rule-based approach (when possible) using IFTTT, auxiliary data, etc.
3. Slowly add complexity by addressing limitations and motivating representations and model architectures.
4. Weigh tradeoffs (performance, latency, size, etc.) between performant baselines.
5. Revisit and iterate on baselines as your dataset grows.

We can also baseline on your dataset. Instead of using a fixed dataset and iterating on the models, choose a good baseline and iterate on the dataset:

• remove or fix data samples (FP, FN)
• prepare and transform features
• expand or consolidate classes
• incorporate auxiliary datasets
• identify unique slices to boost

When choosing what model architecture(s) to proceed with, what are important tradeoffs to consider? And how can we prioritize them?

• performance: consider coarse-grained and fine-grained (ex. per-class) performance.
• latency: how quickly does your model respond for inference.
• size: how large is your model and can you support it's storage.
• compute: how much will it cost ($, carbon footprint, etc.) to train your model? • interpretability: does your model need to explain its predictions? • bias checks: does your model pass key bias checks? • time to develop: how long do you have to develop the first version? • time to retrain: how long does it take to retrain your model? This is very important to consider if you need to retrain often. • maintenance overhead: who and what will be required to maintain your model versions because the real work with ML begins after deploying v1. You can't just hand it off to your site reliability team to maintain it like many teams do with traditional software. ## Application Each application's baseline trajectory varies based on the task and motivations. For our application, we're going to follow this path: We'll motivate the need for slowly adding complexity from both the representation (ex. embeddings) and architecture (ex. CNNs) views, as well as address the limitation at each step of the way. If you're unfamiliar with of the modeling concepts here, be sure to check out the Foundations lessons. We'll first set up some functions that we'll be using across the different baseline experiments.  1 2 from sklearn.metrics import precision_recall_fscore_support import torch   1 2 3 4 5 6 7 def set_seeds(seed=1234): """Set seeds for reproducibility.""" np.random.seed(seed) random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # multi-GPU    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def get_data_splits(df, train_size=0.7): """""" # Get data X = df.text.to_numpy() y = df.tags # Binarize y label_encoder = LabelEncoder() label_encoder.fit(y) y = label_encoder.encode(y) # Split X_train, X_, y_train, y_ = iterative_train_test_split( X, y, train_size=train_size) X_val, X_test, y_val, y_test = iterative_train_test_split( X_, y_, train_size=0.5) return X_train, X_val, X_test, y_train, y_val, y_test, label_encoder  We'll define a Trainer object which we will use for training, validation and inference.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 class Trainer(object): def __init__(self, model, device, loss_fn=None, optimizer=None, scheduler=None): # Set params self.model = model self.device = device self.loss_fn = loss_fn self.optimizer = optimizer self.scheduler = scheduler def train_step(self, dataloader): """Train step.""" # Set model to train mode self.model.train() loss = 0.0 # Iterate over train batches for i, batch in enumerate(dataloader): # Step batch = [item.to(self.device) for item in batch] # Set device inputs, targets = batch[:-1], batch[-1] self.optimizer.zero_grad() # Reset gradients z = self.model(inputs) # Forward pass J = self.loss_fn(z, targets) # Define loss J.backward() # Backward pass self.optimizer.step() # Update weights # Cumulative Metrics loss += (J.detach().item() - loss) / (i + 1) return loss def eval_step(self, dataloader): """Validation or test step.""" # Set model to eval mode self.model.eval() loss = 0.0 y_trues, y_probs = [], [] # Iterate over val batches with torch.inference_mode(): for i, batch in enumerate(dataloader): # Step batch = [item.to(self.device) for item in batch] # Set device inputs, y_true = batch[:-1], batch[-1] z = self.model(inputs) # Forward pass J = self.loss_fn(z, y_true).item() # Cumulative Metrics loss += (J - loss) / (i + 1) # Store outputs y_prob = torch.sigmoid(z).cpu().numpy() y_probs.extend(y_prob) y_trues.extend(y_true.cpu().numpy()) return loss, np.vstack(y_trues), np.vstack(y_probs) def predict_step(self, dataloader): """Prediction step.""" # Set model to eval mode self.model.eval() y_probs = [] # Iterate over val batches with torch.inference_mode(): for i, batch in enumerate(dataloader): # Forward pass w/ inputs inputs, targets = batch[:-1], batch[-1] z = self.model(inputs) # Store outputs y_prob = torch.sigmoid(z).cpu().numpy() y_probs.extend(y_prob) return np.vstack(y_probs) def train(self, num_epochs, patience, train_dataloader, val_dataloader): best_val_loss = np.inf for epoch in range(num_epochs): # Steps train_loss = self.train_step(dataloader=train_dataloader) val_loss, _, _ = self.eval_step(dataloader=val_dataloader) self.scheduler.step(val_loss) # Early stopping if val_loss < best_val_loss: best_val_loss = val_loss best_model = self.model _patience = patience # reset _patience else: _patience -= 1 if not _patience: # 0 print("Stopping early!") break # Logging print( f"Epoch: {epoch+1} | " f"train_loss: {train_loss:.5f}, " f"val_loss: {val_loss:.5f}, " f"lr: {self.optimizer.param_groups[0]['lr']:.2E}, " f"_patience: {_patience}" ) return best_model  Note Our dataset is small so we'll train using the whole dataset but for larger datasets, we should always test on a small subset (after shuffling when necessary) so we aren't wasting time on compute. Here's how we can easily do this:  1 2 3 4 5 6 7 # Shuffling since projects are chronologically organized if shuffle: df = df.sample(frac=1).reset_index(drop=True) # Subset if num_samples: df = df[:num_samples]  ## Random motivation: We want to know what random (chance) performance looks like. All of our subsequent baselines should perform better than this.  1 2 # Set seeds set_seeds()   1 2 3 4 5 6 7 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) print (f"X_train: {X_train.shape}, y_train: {y_train.shape}") print (f"X_val: {X_val.shape}, y_val: {y_val.shape}") print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")  X_train: (1000,), y_train: (1000, 35) X_val: (227,), y_val: (227, 35) X_test: (217,), y_test: (217, 35)   1 2 3 # Label encoder print (label_encoder) print (label_encoder.classes)  <LabelEncoder(num_classes=35)> ['attention', 'autoencoders', 'computer-vision', 'convolutional-neural-networks', 'data-augmentation', 'embeddings', 'flask', 'generative-adversarial-networks', 'graph-neural-networks', 'graphs', 'huggingface', 'image-classification', 'interpretability', 'keras', 'language-modeling', 'natural-language-processing', 'node-classification', 'object-detection', 'pretraining', 'production', 'pytorch', 'question-answering', 'regression', 'reinforcement-learning', 'representation-learning', 'scikit-learn', 'segmentation', 'self-supervised-learning', 'tensorflow', 'tensorflow-js', 'time-series', 'transfer-learning', 'transformers', 'unsupervised-learning', 'wandb']   1 2 3 4 # Generate random predictions y_pred = np.random.randint(low=0, high=2, size=(len(y_test), len(label_encoder.classes))) print (y_pred.shape) print (y_pred[0:5])  (217, 35) [[0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 1] [0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1] [1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1] [0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0] [0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 0]]   1 2 3 4 # Evaluate metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") performance = {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} print (json.dumps(performance, indent=2))  { "precision": 0.12590604458654545, "recall": 0.5203426124197003, ← as expected to be ~0.5 "f1": 0.18469743862395557 }  We made the assumption that there is an equal probability for whether an input has a tag or not but this isn't true. Let's use the train split to figure out what the true probability is.  1 2 3 # Percentage of 1s (tag presence) tag_p = np.sum(np.sum(y_train)) / (len(y_train) * len(label_encoder.classes)) print (tag_p)  0.06291428571428571   1 2 3 4 # Generate weighted random predictions y_pred = np.random.choice( np.arange(0, 2), size=(len(y_test), len(label_encoder.classes)), p=[1-tag_p, tag_p])   1 2 # Validate percentage np.sum(np.sum(y_pred)) / (len(y_pred) * len(label_encoder.classes))  0.06240947992100066   1 2 3 4 # Evaluate metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") performance = {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} print (json.dumps(performance, indent=2))  { "precision": 0.1121905967477629, "recall": 0.047109207708779445, "f1": 0.05309836327850377 }  limitations: we didn't use any of the signals from our inputs to affect our predictions, so nothing was learned. ## Rule-based motivation: we want to use signals in our inputs (along with domain expertise and auxiliary data) to determine the labels.  1 2 # Set seeds set_seeds()  ### Unstemmed  1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df)   1 2 3 4 # Restrict to relevant tags print (len(tags_dict)) tags_dict = {tag: tags_dict[tag] for tag in label_encoder.classes} print (len(tags_dict))  400 35   1 2 3 4 5 6 7 # Map aliases aliases = {} for tag, values in tags_dict.items(): aliases[preprocess(tag)] = tag for alias in values["aliases"]: aliases[preprocess(alias)] = tag aliases  {'ae': 'autoencoders', 'attention': 'attention', 'autoencoders': 'autoencoders', 'cnn': 'convolutional-neural-networks', 'computer vision': 'computer-vision', ... 'unsupervised learning': 'unsupervised-learning', 'vision': 'computer-vision', 'wandb': 'wandb', 'weights biases': 'wandb'}    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 def get_classes(text, aliases, tags_dict): """If a token matches an alias, then add the corresponding tag class (and parent tags if any).""" classes = [] for alias, tag in aliases.items(): if alias in text: # Add tag classes.append(tag) # Add parent tags for parent in tags_dict[tag]["parents"]: classes.append(parent) return list(set(classes))   1 2 3 # Sample text = "This project extends gans for data augmentation specifically for object detection tasks." get_classes(text=preprocess(text), aliases=aliases, tags_dict=tags_dict)  ['object-detection', 'data-augmentation', 'generative-adversarial-networks', 'computer-vision']   1 2 3 4 5 # Prediction y_pred = [] for text in X_test: classes = get_classes(text, aliases, tags_dict) y_pred.append(classes)   1 2 # Encode labels y_pred = label_encoder.encode(y_pred)   1 2 3 4 # Evaluate metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") performance = {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} print (json.dumps(performance, indent=2))  { "precision": 0.8527917293434535, "recall": 0.38066760941576216, "f1": 0.48975323243320396, "num_samples": 480.0 }   1 2 3 # Inspection tag = "transformers" print (json.dumps(performance["class"][tag], indent=2))  { "precision": 0.886542414851697, "recall": 0.430406852248394, "f1": 0.556927275918014 }  ### Stemmed We're looking for exact matches with the aliases which isn't always perfect, for example:  1 2 3 4 print (aliases[preprocess("gan")]) # print (aliases[preprocess('gans')]) # this won't find any match print (aliases[preprocess("generative adversarial networks")]) # print (aliases[preprocess('generative adversarial network')]) # this won't find any match  generative-adversarial-networks generative-adversarial-networks  We don't want to keep adding explicit rules but we can use stemming to represent different forms of a word uniformly, for example:  1 2 print (porter.stem("democracy")) print (porter.stem("democracies"))  democraci democraci  So let's now stem our aliases as well as the tokens in our input text and then look for matches.  1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df)   1 2 3 4 5 6 7 # Map aliases aliases = {} for tag, values in tags_dict.items(): aliases[preprocess(tag, stem=True)] = tag for alias in values["aliases"]: aliases[preprocess(alias, stem=True)] = tag aliases  {'ae': 'autoencoders', 'attent': 'attention', 'autoencod': 'autoencoders', 'cnn': 'convolutional-neural-networks', 'comput vision': 'computer-vision', ... 'vision': 'computer-vision', 'wandb': 'wandb', 'weight bias': 'wandb'}   1 2 3 4 5 # Checks (we will write proper tests soon) print (aliases[preprocess("gan", stem=True)]) print (aliases[preprocess("gans", stem=True)]) print (aliases[preprocess("generative adversarial network", stem=True)]) print (aliases[preprocess("generative adversarial networks", stem=True)])  generative-adversarial-networks generative-adversarial-networks generative-adversarial-networks generative-adversarial-networks  We'll write proper tests for all of these functions when we move our code to Python scripts.  1 2 3 # Sample text = "This project extends gans for data augmentation specifically for object detection tasks." get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  ['object-detection', 'data-augmentation', 'generative-adversarial-networks', 'computer-vision']   1 2 3 4 5 # Prediction y_pred = [] for text in X_test: classes = get_classes(text, aliases, tags_dict) y_pred.append(classes)   1 2 # Encode labels y_pred = label_encoder.encode(y_pred)  ### Evaluation We can look at overall and per-class performance on our test set. When considering overall and per-class performance across different models, we should be aware of Simpson's paradox where a model can perform better on every class subset but not overall.  1 2 3 4 # Evaluate metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") performance = {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} print (json.dumps(performance, indent=2))  { "precision": 0.907266867724384, "recall": 0.485838779956427, "f1": 0.6120705676738784 }  Can we explain the high precision and low recall? We achieved very high precision at the expensive of low recall. Why? Show answer Rule-based approaches can yield labels with high certainty when there is an absolute condition match (high precision) but it fails to generalize or learn implicit patterns to capture the rest of the cases (low recall). ### Inference  1 2 3 4 # Infer text = "Transfer learning with transformers for self-supervised learning" print (preprocess(text, stem=True)) get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  transfer learn transform self supervis learn ['self-supervised-learning', 'transfer-learning', 'transformers', 'natural-language-processing']  Now let's see what happens when we replace the word transformers with BERT. Sure we can add this as an alias but doing these kinds of ad-hoc updates can quickly add overhead. This is where it makes sense to learn from the data as opposed to creating explicit rules.  1 2 3 4 # Infer text = "Transfer learning with BERT for self-supervised learning" print (preprocess(text, stem=True)) get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  transfer learn bert self supervis learn ['self-supervised-learning', 'transfer-learning']  limitations: we failed to generalize or learn any implicit patterns to predict the labels because we treat the tokens in our input as isolated entities. We would ideally spend more time tuning our model because it's so simple and quick to train. This approach also applies to all the other models we'll look at as well. ## Simple ML motivation: • representation: use term frequency-inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs, as opposed to treating the words in our input text as isolated tokens. • architecture: we want our model to meaningfully extract the encoded signal to predict the output labels. So far we've treated the words in our input text as isolated tokens and we haven't really captured any meaning between tokens. Let's use term frequency–inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs. $w_{i, j} = \text{tf}_{i, j} * log(\frac{N}{\text{df}_i})$ Variable Description $$w_{i, j}$$ tf-idf weight for term $$i$$ in document $$j$$ $$\text{tf}_{i, j}$$ # of times term $$i$$ appear in document $$j$$ $$N$$ total # of documents$
$$\text{df}_i$$ # of documents with token $$i$$

 1 2 3 4 5 6 from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import LinearSVC 
 1 2 3 4 5 from sklearn import metrics from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import accuracy_score, precision_score, recall_score from sklearn.metrics import precision_recall_curve from sklearn.preprocessing import MultiLabelBinarizer 
 1 2 # Set seeds set_seeds() 
 1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) 
 1 2 3 4 5 6 7 8 # Tf-idf vectorizer = TfidfVectorizer() print (X_train[0]) X_train = vectorizer.fit_transform(X_train) X_val = vectorizer.transform(X_val) X_test = vectorizer.transform(X_test) print (X_train.shape) print (X_train[0]) # scipy.sparse.csr_matrix 

albument fast imag augment librari easi use wrapper around librari
(1000, 2654)
(0, 190)  0.34307733697679055
(0, 2630) 0.3991510203964918
(0, 2522) 0.14859192074955896
(0, 728)  0.29210630687446
(0, 1356) 0.4515371929370289
(0, 217)  0.2870036535570893
(0, 1157) 0.18851186612963625
(0, 876)  0.31431481238098835
(0, 118)  0.44156912440424356


 1 2 3 4 5 6 def fit_and_evaluate(model): """Fit and evaluate each model.""" model.fit(X_train, y_train) y_pred = model.predict(X_test) metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") return {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} 
  1 2 3 4 5 6 7 8 9 10 11 12 13 # Models performance = {} performance["logistic-regression"] = fit_and_evaluate(OneVsRestClassifier( LogisticRegression(), n_jobs=1)) performance["k-nearest-neighbors"] = fit_and_evaluate( KNeighborsClassifier()) performance["random-forest"] = fit_and_evaluate( RandomForestClassifier(n_jobs=-1)) performance["gradient-boosting-machine"] = fit_and_evaluate(OneVsRestClassifier( GradientBoostingClassifier())) performance["support-vector-machine"] = fit_and_evaluate(OneVsRestClassifier( LinearSVC(), n_jobs=-1)) print (json.dumps(performance, indent=2)) 

{
"logistic-regression": {
"precision": 0.633369022127052,
"recall": 0.21841541755888652,
"f1": 0.3064204603390899
},
"k-nearest-neighbors": {
"precision": 0.7410281119097024,
"recall": 0.47109207708779444,
"f1": 0.5559182508714337
},
"random-forest": {
"precision": 0.7722866712160075,
"recall": 0.38329764453961457,
"f1": 0.4852512297132596
},
"precision": 0.8503271303309295,
"recall": 0.6167023554603854,
"f1": 0.7045318461336975
},
"support-vector-machine": {
"precision": 0.8938397993500261,
"recall": 0.5460385438972163,
"f1": 0.6527334570244009
}
}


limitations:

• representation: TF-IDF representations don't encapsulate much signal beyond frequency but we require more fine-grained token representations.
• architecture: we want to develop models that can use better represented encodings in a more contextual manner.

## Distributed training

All the training we need to do for our application happens on one worker with one accelerator (GPU), however, we'll want to consider distributed training for very large models or when dealing with large datasets. Distributed training can involve:

• data parallelism: workers received different slices of the larger dataset.
• synchronous training uses AllReduce to aggregate gradients and update all the workers weights at the end of each batch (synchronous).
• asynchronous training uses a universal parameter server to update weights as each worker trains on its slice of data (asynchronous).
• model parallelism: all workers use the same dataset but the model is split amongst them (more difficult to implement compared to data parallelism because it's difficult to isolate and combine signal from backpropagation).

There are lots of options for applying distributed training such as with PyTorch's distributed package, Ray, Horovd, etc.

## Optimization

Distributed training strategies are great for when our data or models are too large for training but what about when our models are too large to deploy? The following model compression techniques are commonly used to make large models fit within existing infrastructure:

• Pruning: remove weights (unstructured) or entire channels (structured) to reduce the size of the network. The objective is to preserve the model’s performance while increasing its sparsity.
• Quantization: reduce the memory footprint of the weights by reducing their precision (ex. 32 bit to 8 bit). We may loose some precision but it shouldn’t affect performance too much.
• Distillation: training smaller networks to “mimic” larger networks by having it reproduce the larger network’s layers’ outputs.
Distilling the knowledge in a neural network [source]

## CNN w/ Embeddings

motivation:

• representation: we want to have more robust (split tokens to characters) and meaningful embedding-based representations for our input tokens.
• architecture: we want to process our encoded inputs using convolution (CNN) filters that can learn to analyze windows of embedded tokens to extract meaningful signal.

### Set up

We'll set up the task by setting seeds for reproducibility, creating our data splits abd setting the device.

 1 2 3 4 import math import torch import torch.nn as nn import torch.nn.functional as F 
 1 2 # Set seeds set_seeds() 
 1 2 3 4 5 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) X_test_raw = X_test # use for later 
 1 2 3 4 # Split DataFrames train_df = pd.DataFrame({"text": X_train, "tags": label_encoder.decode(y_train)}) val_df = pd.DataFrame({"text": X_val, "tags": label_encoder.decode(y_val)}) test_df = pd.DataFrame({"text": X_test, "tags": label_encoder.decode(y_test)}) 
 1 2 3 4 5 6 7 8 # Set device cuda = True device = torch.device("cuda" if ( torch.cuda.is_available() and cuda) else "cpu") torch.set_default_tensor_type("torch.FloatTensor") if device.type == "cuda": torch.set_default_tensor_type("torch.cuda.FloatTensor") print (device) 

cuda


### Tokenizer

We're going to tokenize our input text as character tokens so we can be robust to spelling errors and learn to generalize across tags. (ex. learning that RoBERTa, or any other future BERT based archiecture, warrants same tag as BERT).

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 class Tokenizer(object): def __init__(self, char_level, num_tokens=None, pad_token="", oov_token="", token_to_index=None): self.char_level = char_level self.separator = "" if self.char_level else " " if num_tokens: num_tokens -= 2 # pad + unk tokens self.num_tokens = num_tokens self.pad_token = pad_token self.oov_token = oov_token if not token_to_index: token_to_index = {pad_token: 0, oov_token: 1} self.token_to_index = token_to_index self.index_to_token = {v: k for k, v in self.token_to_index.items()} def __len__(self): return len(self.token_to_index) def __str__(self): return f"" def fit_on_texts(self, texts): if not self.char_level: texts = [text.split(" ") for text in texts] all_tokens = [token for text in texts for token in text] counts = Counter(all_tokens).most_common(self.num_tokens) self.min_token_freq = counts[-1][1] for token, count in counts: index = len(self) self.token_to_index[token] = index self.index_to_token[index] = token return self def texts_to_sequences(self, texts): sequences = [] for text in texts: if not self.char_level: text = text.split(" ") sequence = [] for token in text: sequence.append(self.token_to_index.get( token, self.token_to_index[self.oov_token])) sequences.append(np.asarray(sequence)) return sequences def sequences_to_texts(self, sequences): texts = [] for sequence in sequences: text = [] for index in sequence: text.append(self.index_to_token.get(index, self.oov_token)) texts.append(self.separator.join([token for token in text])) return texts def save(self, fp): with open(fp, "w") as fp: contents = { "char_level": self.char_level, "oov_token": self.oov_token, "token_to_index": self.token_to_index } json.dump(contents, fp, indent=4, sort_keys=False) @classmethod def load(cls, fp): with open(fp, "r") as fp: kwargs = json.load(fp=fp) return cls(**kwargs) 
 1 2 3 4 5 6 # Tokenize char_level = True tokenizer = Tokenizer(char_level=char_level) tokenizer.fit_on_texts(texts=X_train) vocab_size = len(tokenizer) print (tokenizer) 

<Tokenizer(num_tokens=39)>

 1 tokenizer.token_to_index 
{' ': 2,
'0': 30,
'1': 31,
'2': 26,
...
'<UNK>': 1,
...
'x': 25,
'y': 21,
'z': 27}

 1 2 3 4 5 6 7 8 # Convert texts to sequences of indices X_train = np.array(tokenizer.texts_to_sequences(X_train)) X_val = np.array(tokenizer.texts_to_sequences(X_val)) X_test = np.array(tokenizer.texts_to_sequences(X_test)) preprocessed_text = tokenizer.sequences_to_texts([X_train[0]])[0] print ("Text to indices:\n" f" (preprocessed) → {preprocessed_text}\n" f" (tokenized) → {X_train[0]}") 
Text to indices:
(preprocessed) → hugging face achieved 2x performance boost qa question answering distilbert node js
(tokenized) → [18 17 15 15  4  5 15  2 19  7 12  3  2  7 12 18  4  3 22  3 14  2 26 25
2 13  3  8 19 10  8 16  7  5 12  3  2 20 10 10  9  6  2 30  7  2 30 17
3  9  6  4 10  5  2  7  5  9 23  3  8  4  5 15  2 14  4  9  6  4 11 20
3  8  6  2  5 10 14  3  2 28  9]


### Data imbalance

We'll factor class weights in our objective function (binary cross entropy with logits) to help with class imbalance. There are many other techniques such as over sampling from underrepresented classes, undersampling, etc. but we'll cover these in a separate unit lesson on data imbalance.

 1 2 3 4 # Class weights counts = np.bincount([label_encoder.class_to_index[class_] for class_ in all_tags]) class_weights = {i: 1.0/count for i, count in enumerate(counts)} print (f"class counts: {counts},\nclass weights: {class_weights}") 
class counts: [120  41 388 106  41  75  34  73  51  78  64  51  55  93  51 429  33  69
30  51 258  32  49  59  57  60  48  40 213  40  34  46 196  39  39],
class weights: {0: 0.008333333333333333, 1: 0.024390243902439025, 2: 0.002577319587628866, 3: 0.009433962264150943, 4: 0.024390243902439025, 5: 0.013333333333333334, 6: 0.029411764705882353, 7: 0.0136986301369863, 8: 0.0196078431372549, 9: 0.01282051282051282, 10: 0.015625, 11: 0.0196078431372549, 12: 0.01818181818181818, 13: 0.010752688172043012, 14: 0.0196078431372549, 15: 0.002331002331002331, 16: 0.030303030303030304, 17: 0.014492753623188406, 18: 0.03333333333333333, 19: 0.0196078431372549, 20: 0.003875968992248062, 21: 0.03125, 22: 0.02040816326530612, 23: 0.01694915254237288, 24: 0.017543859649122806, 25: 0.016666666666666666, 26: 0.020833333333333332, 27: 0.025, 28: 0.004694835680751174, 29: 0.025, 30: 0.029411764705882353, 31: 0.021739130434782608, 32: 0.00510204081632653, 33: 0.02564102564102564, 34: 0.02564102564102564}


### Datasets

We're going to place our data into a Dataset and use a DataLoader to efficiently create batches for training and evaluation.

 1 2 3 4 5 6 7 def pad_sequences(sequences, max_seq_len=0): """Pad sequences to max length in sequence.""" max_seq_len = max(max_seq_len, max(len(sequence) for sequence in sequences)) padded_sequences = np.zeros((len(sequences), max_seq_len)) for i, sequence in enumerate(sequences): padded_sequences[i][:len(sequence)] = sequence return padded_sequences 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 class CNNTextDataset(torch.utils.data.Dataset): def __init__(self, X, y, max_filter_size): self.X = X self.y = y self.max_filter_size = max_filter_size def __len__(self): return len(self.y) def __str__(self): return f"" def __getitem__(self, index): X = self.X[index] y = self.y[index] return [X, y] def collate_fn(self, batch): """Processing on a batch.""" # Get inputs batch = np.array(batch) X = batch[:, 0] y = batch[:, 1] # Pad inputs X = pad_sequences(sequences=X, max_seq_len=self.max_filter_size) # Cast X = torch.LongTensor(X.astype(np.int32)) y = torch.FloatTensor(y.astype(np.int32)) return X, y def create_dataloader(self, batch_size, shuffle=False, drop_last=False): return torch.utils.data.DataLoader( dataset=self, batch_size=batch_size, collate_fn=self.collate_fn, shuffle=shuffle, drop_last=drop_last, pin_memory=True) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Create datasets filter_sizes = list(range(1, 11)) train_dataset = CNNTextDataset( X=X_train, y=y_train, max_filter_size=max(filter_sizes)) val_dataset = CNNTextDataset( X=X_val, y=y_val, max_filter_size=max(filter_sizes)) test_dataset = CNNTextDataset( X=X_test, y=y_test, max_filter_size=max(filter_sizes)) print ("Data splits:\n" f" Train dataset:{train_dataset.__str__()}\n" f" Val dataset: {val_dataset.__str__()}\n" f" Test dataset: {test_dataset.__str__()}\n" "Sample point:\n" f" X: {train_dataset[0][0]}\n" f" y: {train_dataset[0][1]}") 

Data splits:
Train dataset: <Dataset(N=1000)>
Val dataset: <Dataset(N=227)>
Test dataset: <Dataset(N=217)>
Sample point:
X: [ 7 11 20 17 16  3  5  6  7  6  4 10  5  9  2 19  7  9  6  2  4 16  7 14
3  2  7 17 14 16  3  5  6  7  6  4 10  5  2 11  4 20  8  7  8 21  2  3
7  9 21  2 17  9  3  2 23  8  7 13 13  3  8  2  7  8 10 17  5 15  2 11
4 20  8  7  8  4  3  9]
y: [0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

  1 2 3 4 5 6 7 8 9 10 11 12 # Create dataloaders batch_size = 64 train_dataloader = train_dataset.create_dataloader( batch_size=batch_size) val_dataloader = val_dataset.create_dataloader( batch_size=batch_size) test_dataloader = test_dataset.create_dataloader( batch_size=batch_size) batch_X, batch_y = next(iter(train_dataloader)) print ("Sample batch:\n" f" X: {list(batch_X.size())}\n" f" y: {list(batch_y.size())}") 
Sample batch:
X: [64, 186]
y: [64, 35]


### Model

We'll be using a convolutional neural network on top of our embedded tokens to extract meaningful spatial signal. This time, we'll be using many filter widths to act as n-gram feature extractors. If you're not familiar with CNNs be sure to check out the CNN lesson where we walkthrough every component of the architecture.

Let's visualize the model's forward pass.

1. We'll first tokenize our inputs (batch_size, max_seq_len).
2. Then we'll embed our tokenized inputs (batch_size, max_seq_len, embedding_dim).
3. We'll apply convolution via filters (filter_size, embedding_dim, num_filters) followed by batch normalization. Our filters act as character level n-gram detectors. We have three different filter sizes (2, 3 and 4) and they will act as bi-gram, tri-gram and 4-gram feature extractors, respectively.
4. We'll apply 1D global max pooling which will extract the most relevant information from the feature maps for making the decision.
5. We feed the pool outputs to a fully-connected (FC) layer (with dropout).
6. We use one more FC layer with softmax to derive class probabilities.

 1 2 3 4 5 # Arguments embedding_dim = 128 num_filters = 128 hidden_dim = 128 dropout_p = 0.5 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 class CNN(nn.Module): def __init__(self, embedding_dim, vocab_size, num_filters, filter_sizes, hidden_dim, dropout_p, num_classes, padding_idx=0): super(CNN, self).__init__() # Initialize embeddings self.embeddings = nn.Embedding( embedding_dim=embedding_dim, num_embeddings=vocab_size, padding_idx=padding_idx) # Conv weights self.filter_sizes = filter_sizes self.conv = nn.ModuleList( [nn.Conv1d(in_channels=embedding_dim, out_channels=num_filters, kernel_size=f) for f in filter_sizes]) # FC weights self.dropout = nn.Dropout(dropout_p) self.fc1 = nn.Linear(num_filters*len(filter_sizes), hidden_dim) self.fc2 = nn.Linear(hidden_dim, num_classes) def forward(self, inputs, channel_first=False): # Embed x_in, = inputs x_in = self.embeddings(x_in) if not channel_first: x_in = x_in.transpose(1, 2) # (N, channels, sequence length) z = [] max_seq_len = x_in.shape[2] for i, f in enumerate(self.filter_sizes): # SAME padding padding_left = int( (self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2) padding_right = int(math.ceil( (self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2)) # Conv _z = self.conv[i](F.pad(x_in, (padding_left, padding_right))) # Pool _z = F.max_pool1d(_z, _z.size(2)).squeeze(2) z.append(_z) # Concat outputs z = torch.cat(z, 1) # FC z = self.fc1(z) z = self.dropout(z) z = self.fc2(z) return z 

• VALID: no padding, the filters only use the "valid" values in the input. If the filter cannot reach all the input values (filters go left to right), the extra values on the right are dropped.
• SAME: adds padding evenly to the right (preferred) and left sides of the input so that all values in the input are processed.

We're add SAME padding so that the convolutional outputs are the same shape as our inputs. The amount of padding for the SAME padding can be determined using the same equation. We want out output to have the same width as our input, so we solve for P:

$\frac{W-F+2P}{S} + 1 = W$
$P = \frac{S(W-1) - W + F}{2}$

If $$P$$ is not a whole number, we round up (using math.ceil) and place the extra padding on the right side.

 1 2 3 4 5 6 7 8 # Initialize model model = CNN( embedding_dim=embedding_dim, vocab_size=vocab_size, num_filters=num_filters, filter_sizes=filter_sizes, hidden_dim=hidden_dim, dropout_p=dropout_p, num_classes=num_classes) model = model.to(device) print (model.named_parameters) 
<bound method Module.named_parameters of CNN(
(conv): ModuleList(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): Conv1d(128, 128, kernel_size=(2,), stride=(1,))
(2): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
(3): Conv1d(128, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(128, 128, kernel_size=(5,), stride=(1,))
(5): Conv1d(128, 128, kernel_size=(6,), stride=(1,))
(6): Conv1d(128, 128, kernel_size=(7,), stride=(1,))
(7): Conv1d(128, 128, kernel_size=(8,), stride=(1,))
(8): Conv1d(128, 128, kernel_size=(9,), stride=(1,))
(9): Conv1d(128, 128, kernel_size=(10,), stride=(1,))
)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=1280, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=35, bias=True)
)>


### Training

 1 2 3 4 # Arguments lr = 2e-4 num_epochs = 100 patience = 10 
 1 2 3 # Define loss class_weights_tensor = torch.Tensor(np.array(list(class_weights.values()))) loss_fn = nn.BCEWithLogitsLoss(weight=class_weights_tensor) 
 1 2 3 4 # Define optimizer & scheduler optimizer = torch.optim.Adam(model.parameters(), lr=lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode="min", factor=0.1, patience=5) 
 1 2 3 4 # Trainer module trainer = Trainer( model=model, device=device, loss_fn=loss_fn, optimizer=optimizer, scheduler=scheduler) 
 1 2 3 # Train best_model = trainer.train( num_epochs, patience, train_dataloader, val_dataloader) 

Epoch: 1 | train_loss: 0.00624, val_loss: 0.00285, lr: 2.00E-04, _patience: 10
Epoch: 2 | train_loss: 0.00401, val_loss: 0.00283, lr: 2.00E-04, _patience: 10
Epoch: 3 | train_loss: 0.00362, val_loss: 0.00266, lr: 2.00E-04, _patience: 10
Epoch: 4 | train_loss: 0.00332, val_loss: 0.00263, lr: 2.00E-04, _patience: 10
...
Epoch: 49 | train_loss: 0.00061, val_loss: 0.00149, lr: 2.00E-05, _patience: 4
Epoch: 50 | train_loss: 0.00055, val_loss: 0.00159, lr: 2.00E-05, _patience: 3
Epoch: 51 | train_loss: 0.00056, val_loss: 0.00152, lr: 2.00E-05, _patience: 2
Epoch: 52 | train_loss: 0.00057, val_loss: 0.00156, lr: 2.00E-05, _patience: 1
Stopping early!


### Evaluation

 1 2 from pathlib import Path from sklearn.metrics import precision_recall_curve 
 1 2 3 4 5 6 7 8 # Threshold-PR curve train_loss, y_true, y_prob = trainer.eval_step(dataloader=train_dataloader) precisions, recalls, thresholds = precision_recall_curve(y_true.ravel(), y_prob.ravel()) plt.plot(thresholds, precisions[:-1], "r--", label="Precision") plt.plot(thresholds, recalls[:-1], "b-", label="Recall") plt.ylabel("Performance") plt.xlabel("Threshold") plt.legend(loc="best") 

 1 2 3 4 5 6 # Determining the best threshold def find_best_threshold(y_true, y_prob): """Find the best threshold for maximum F1.""" precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob) f1s = (2 * precisions * recalls) / (precisions + recalls) return thresholds[np.argmax(f1s)] 
 1 2 3 # Best threshold for f1 threshold = find_best_threshold(y_true.ravel(), y_prob.ravel()) threshold 
0.23890994


How can we do better?

How can we improve on our process of identifying and using the appropriate threshold?

• Plot PR curves for all classes (not just overall) to ensure a certain global threshold doesn't deliver very poor performance for any particular class
• Determine different thresholds for different classes and use them during inference

 1 2 3 # Determine predictions using threshold test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) 
 1 2 3 4 # Evaluate metrics = precision_recall_fscore_support(y_test, y_pred, average="weighted") performance = {"precision": metrics[0], "recall": metrics[1], "f1": metrics[2]} print (json.dumps(performance, indent=2)) 

{
"precision": 0.795787334399384,
"recall": 0.5944206008583691,
"f1": 0.6612833723992106
}

 1 2 3 4 5 6 7 8 # Save artifacts dir = Path("cnn") dir.mkdir(parents=True, exist_ok=True) tokenizer.save(fp=Path(dir, "tokenzier.json")) label_encoder.save(fp=Path(dir, "label_encoder.json")) torch.save(best_model.state_dict(), Path(dir, "model.pt")) with open(Path(dir, "performance.json"), "w") as fp: json.dump(performance, indent=2, sort_keys=False, fp=fp) 

### Inference

  1 2 3 4 5 6 7 8 9 10 11 # Load artifacts device = torch.device("cpu") tokenizer = Tokenizer.load(fp=Path(dir, "tokenzier.json")) label_encoder = LabelEncoder.load(fp=Path(dir, "label_encoder.json")) model = CNN( embedding_dim=embedding_dim, vocab_size=vocab_size, num_filters=num_filters, filter_sizes=filter_sizes, hidden_dim=hidden_dim, dropout_p=dropout_p, num_classes=num_classes) model.load_state_dict(torch.load(Path(dir, "model.pt"), map_location=device)) model.to(device) 
CNN(
(conv): ModuleList(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): Conv1d(128, 128, kernel_size=(2,), stride=(1,))
(2): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
(3): Conv1d(128, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(128, 128, kernel_size=(5,), stride=(1,))
(5): Conv1d(128, 128, kernel_size=(6,), stride=(1,))
(6): Conv1d(128, 128, kernel_size=(7,), stride=(1,))
(7): Conv1d(128, 128, kernel_size=(8,), stride=(1,))
(8): Conv1d(128, 128, kernel_size=(9,), stride=(1,))
(9): Conv1d(128, 128, kernel_size=(10,), stride=(1,))
)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=1280, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=35, bias=True)
)


 1 2 # Initialize trainer trainer = Trainer(model=model, device=device) 
 1 2 3 4 5 6 7 8 # Dataloader text = "Transfer learning with BERT for self-supervised learning" X = np.array(tokenizer.texts_to_sequences([preprocess(text)])) y_filler = label_encoder.encode([np.array([label_encoder.classes[0]]*len(X))]) dataset = CNNTextDataset( X=X, y=y_filler, max_filter_size=max(filter_sizes)) dataloader = dataset.create_dataloader( batch_size=batch_size) 
 1 2 3 4 # Inference y_prob = trainer.predict_step(dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) label_encoder.decode(y_pred) 

[['natural-language-processing',
'self-supervised-learning',
'transfer-learning',
'transformers']]


limitations:

• representation: embeddings are not contextual.
• architecture: extracting signal from encoded inputs is limited by filter widths.

Since we're dealing with simple architectures and fast training times, it's a good opportunity to explore tuning and experiment with k-fold cross validation to properly reach any conclusions about performance.

We're going to go with the embeddings via CNN approach and optimize it because performance is quite similar to the contextualized embeddings via transformers approach but at much lower cost.

 1 2 3 4 # Performance with open(Path("cnn", "performance.json"), "r") as fp: cnn_performance = json.load(fp) print (f'CNN: f1 = {cnn_performance["f1"]}') 
CNN: f1 = 0.6612833723992106


This was just one run on one split so you'll want to experiment with k-fold cross validation to properly reach any conclusions about performance. Also make sure you take the time to tune these baselines since their training periods are quite fast (we can achieve f1 of 0.7 with just a bit of tuning for both CNN / Transformers). We'll cover optimization in a few lessons so you can replicate the process here on your own time. We should also benchmark on other important metrics as we iterate, not just precision and recall.

 1 2 # Size print (f'CNN: {Path("cnn", "model.pt").stat().st_size/1000000:.1f} MB') 
CNN: 4.3 MB


We'll consider other tradeoffs such as maintenance overhead, behavioral test performances, etc. as we develop.

Interpretability was not one of requirements but note that we could've tweaked model outputs to deliver it. For example, since we used SAME padding for our CNN, we can use the activation scores to extract influential n-grams.

## Resources

To cite this lesson, please use:

 1 2 3 4 5 6 @article{madewithml, author = {Goku Mohandas}, title = { Baselines - Made With ML }, howpublished = {\url{https://madewithml.com/}}, year = {2021} }