# Modeling Baselines

Goku Mohandas
· ·

Motivating the use of baselines for iterative modeling.
Repository · Notebook

📬  Receive new lessons straight to your inbox (once a month) and join 20K+ developers in learning how to responsibly deliver value with ML.

## Intuition

Baselines are simple benchmarks which pave the way for iterative development:

• Rapid experimentation via hyperparameter tuning thanks to low model complexity.
• Discovery of data issues, false assumptions, bugs in code, etc. since model itself is not complex.
• Pareto's principle: we can achieve decent performance with minimal initial effort.

## Process

Here is the high level approach to establishing baselines:

1. Start with the simplest possible baseline to compare subsequent development with. This is often a random (chance) model.
2. Develop a rule-based approach (when possible) using IFTT, auxiliary data, etc.
3. Slowly add complexity by addressing limitations and motivating representations and model architectures.
4. Weigh tradeoffs (performance, latency, size, etc.) between performant baselines.
5. Revisit and iterate on baselines as your dataset grows.

Note

You can also baseline on your dataset. Instead of using a fixed dataset and iterating on the models, choose a good baseline and iterate on the dataset:

• remove or fix data samples (FP, FN)
• prepare and transform features
• expand or consolidate classes
• incorporate auxiliary datasets
• identify unique slices to improve / upsample

When choosing what model architecture(s) to proceed with, there are a few important aspects to consider:

• performance: consider overall and fine-grained (ex. per-class) performance.
• latency: how quickly does your model respond for inference.
• size: how large is your model and can you support it's storage.
• compute: how much will it cost ($, carbon footprint, etc.) to train your model? • interpretability: does your model need to explain its predictions? • bias checks: does your model pass key bias checks? • time to develop: how long do you have to develop the first version? • time to retrain: how long does it take to retrain your model? This is very important to consider if you need to retrain often. • maintenance overhead: who and what will be required to maintain your model versions because the real work with ML begins after deploying v1. You can't just hand it off to your site reliability team to maintain it like many teams do with traditional software. ## Application Each application's baseline trajectory varies based on the task and motivations. For our application, we're going to follow this path: We'll motivate the need for slowly adding complexity from both the representation (ex. embeddings) and architecture (ex. CNNs) views, as well as address the limitation at each step of the way. Note If you're unfamiliar with of the concepts here, be sure to check out the GokuMohandas/madewithml (🔥 Among top ML repos on GitHub). We'll first set up some functions that we'll be using across the different baseline experiments.  1 2 from sklearn.metrics import precision_recall_fscore_support import torch   1 2 3 4 5 6 7 def set_seeds(seed=1234): """Set seeds for reproducibility.""" np.random.seed(seed) random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed(seed) torch.cuda.manual_seed_all(seed) # multi-GPU    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 def get_data_splits(df, train_size=0.7): """""" # Get data X = df.text.to_numpy() y = df.tags # Binarize y label_encoder = LabelEncoder() label_encoder.fit(y) y = label_encoder.encode(y) # Split X_train, X_, y_train, y_ = iterative_train_test_split( X, y, train_size=train_size) X_val, X_test, y_val, y_test = iterative_train_test_split( X_, y_, train_size=0.5) return X_train, X_val, X_test, y_train, y_val, y_test, label_encoder  We'll define a Trainer object which we will use for training, validation and inference.   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 class Trainer(object): def __init__(self, model, device, loss_fn=None, optimizer=None, scheduler=None): # Set params self.model = model self.device = device self.loss_fn = loss_fn self.optimizer = optimizer self.scheduler = scheduler def train_step(self, dataloader): """Train step.""" # Set model to train mode self.model.train() loss = 0.0 # Iterate over train batches for i, batch in enumerate(dataloader): # Step batch = [item.to(self.device) for item in batch] # Set device inputs, targets = batch[:-1], batch[-1] self.optimizer.zero_grad() # Reset gradients z = self.model(inputs) # Forward pass J = self.loss_fn(z, targets) # Define loss J.backward() # Backward pass self.optimizer.step() # Update weights # Cumulative Metrics loss += (J.detach().item() - loss) / (i + 1) return loss def eval_step(self, dataloader): """Validation or test step.""" # Set model to eval mode self.model.eval() loss = 0.0 y_trues, y_probs = [], [] # Iterate over val batches with torch.no_grad(): for i, batch in enumerate(dataloader): # Step batch = [item.to(self.device) for item in batch] # Set device inputs, y_true = batch[:-1], batch[-1] z = self.model(inputs) # Forward pass J = self.loss_fn(z, y_true).item() # Cumulative Metrics loss += (J - loss) / (i + 1) # Store outputs y_prob = torch.sigmoid(z).cpu().numpy() y_probs.extend(y_prob) y_trues.extend(y_true.cpu().numpy()) return loss, np.vstack(y_trues), np.vstack(y_probs) def predict_step(self, dataloader): """Prediction step.""" # Set model to eval mode self.model.eval() y_probs = [] # Iterate over val batches with torch.no_grad(): for i, batch in enumerate(dataloader): # Forward pass w/ inputs inputs, targets = batch[:-1], batch[-1] y_prob = self.model(inputs) # Store outputs y_probs.extend(y_prob) return np.vstack(y_probs) def train(self, num_epochs, patience, train_dataloader, val_dataloader): best_val_loss = np.inf for epoch in range(num_epochs): # Steps train_loss = self.train_step(dataloader=train_dataloader) val_loss, _, _ = self.eval_step(dataloader=val_dataloader) self.scheduler.step(val_loss) # Early stopping if val_loss < best_val_loss: best_val_loss = val_loss best_model = self.model _patience = patience # reset _patience else: _patience -= 1 if not _patience: # 0 print("Stopping early!") break # Logging print( f"Epoch: {epoch+1} | " f"train_loss: {train_loss:.5f}, " f"val_loss: {val_loss:.5f}, " f"lr: {self.optimizer.param_groups[0]['lr']:.2E}, " f"_patience: {_patience}" ) return best_model    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 def get_metrics(y_true, y_pred, classes): """Per-class performance metrics.""" # Performance performance = {"overall": {}, "class": {}} # Overall performance metrics = precision_recall_fscore_support(y_true, y_pred, average="weighted") performance["overall"]["precision"] = metrics[0] performance["overall"]["recall"] = metrics[1] performance["overall"]["f1"] = metrics[2] performance["overall"]["num_samples"] = np.float64(len(y_true)) # Per-class performance metrics = precision_recall_fscore_support(y_true, y_pred, average=None) for i in range(len(classes)): performance["class"][classes[i]] = { "precision": metrics[0][i], "recall": metrics[1][i], "f1": metrics[2][i], "num_samples": np.float64(metrics[3][i]), } return performance  Note Our dataset is small so we'll train using the whole dataset but for larger datasets, we should always test on a small subset (after shuffling when necessary) so we aren't wasting time on compute. Here's how you can easily do this:  1 2 3 4 5 6 7 # Shuffling since projects are chronologically organized if shuffle: df = df.sample(frac=1).reset_index(drop=True) # Subset if num_samples: df = df[:num_samples]  ## Random motivation: We want to know what random (chance) performance looks like. All of our efforts should be well above this.  1 2 # Set seeds set_seeds()   1 2 3 4 5 6 7 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) print (f"X_train: {X_train.shape}, y_train: {y_train.shape}") print (f"X_val: {X_val.shape}, y_val: {y_val.shape}") print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")  X_train: (1000,), y_train: (1000, 35) X_val: (227,), y_val: (227, 35) X_test: (217,), y_test: (217, 35)   1 2 3 # Label encoder print (label_encoder) print (label_encoder.classes)   ['attention', 'autoencoders', 'computer-vision', 'convolutional-neural-networks', 'data-augmentation', 'embeddings', 'flask', 'generative-adversarial-networks', 'graph-neural-networks', 'graphs', 'huggingface', 'image-classification', 'interpretability', 'keras', 'language-modeling', 'natural-language-processing', 'node-classification', 'object-detection', 'pretraining', 'production', 'pytorch', 'question-answering', 'regression', 'reinforcement-learning', 'representation-learning', 'scikit-learn', 'segmentation', 'self-supervised-learning', 'tensorflow', 'tensorflow-js', 'time-series', 'transfer-learning', 'transformers', 'unsupervised-learning', 'wandb']   1 2 3 4 # Generate random predictions y_pred = np.random.randint(low=0, high=2, size=(len(y_test), len(label_encoder.classes))) print (y_pred.shape) print (y_pred[0:5])  (217, 35) [[0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 1] [0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1] [1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1 1 1 1 0 1 1 0 0 1 0 0 1 1] [0 1 1 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0] [0 0 1 1 1 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 1 1 0]]   1 2 3 4 # Evaluate performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=2))  { "precision": 0.0662941602216066, "recall": 0.5065299488415251, "f1": 0.10819194263879019, "num_samples": 480.0 }  We made the assumption that there is an equal probability for whether an input has a tag or not but this isn't true. Let's use the train split to figure out what the true probability is.  1 2 3 # Percentage of 1s (tag presence) tag_p = np.sum(np.sum(y_train)) / (len(y_train) * len(label_encoder.classes)) print (tag_p)  0.06291428571428571   1 2 3 4 # Generate weighted random predictions y_pred = np.random.choice( np.arange(0, 2), size=(len(y_test), len(label_encoder.classes)), p=[1-tag_p, tag_p])   1 2 # Validate percentage np.sum(np.sum(y_pred)) / (len(y_pred) * len(label_encoder.classes))  0.06240947992100066   1 2 3 4 # Evaluate performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=2))  { "precision": 0.060484184552507536, "recall": 0.053727634571230636, "f1": 0.048704498064854516, "num_samples": 480.0 }  limitations: we didn't use the tokens in our input to affect our predictions so nothing was learned. ## Rule-based motivation: we want to use signals in our inputs (along with domain expertise and auxiliary data) to determine the labels.  1 2 # Set seeds set_seeds()  ### Unstemmed  1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df)   1 2 3 4 # Restrict to relevant tags print (len(tags_dict)) tags_dict = {tag: tags_dict[tag] for tag in label_encoder.classes} print (len(tags_dict))  400 35   1 2 3 4 5 6 7 # Map aliases aliases = {} for tag, values in tags_dict.items(): aliases[preprocess(tag)] = tag for alias in values['aliases']: aliases[preprocess(alias)] = tag aliases  {'ae': 'autoencoders', 'attention': 'attention', 'autoencoders': 'autoencoders', 'cnn': 'convolutional-neural-networks', 'computer vision': 'computer-vision', ... 'unsupervised learning': 'unsupervised-learning', 'vision': 'computer-vision', 'wandb': 'wandb', 'weights biases': 'wandb'}    1 2 3 4 5 6 7 8 9 10 11 def get_classes(text, aliases, tags_dict): """If a token matches an alias, then add the corresponding tag class (and parent tags if any).""" classes = [] for alias, tag in aliases.items(): if alias in text: classes.append(tag) for parent in tags_dict[tag]["parents"]: classes.append(parent) return list(set(classes))   1 2 3 # Sample text = "This project extends gans for data augmentation specifically for object detection tasks." get_classes(text=preprocess(text), aliases=aliases, tags_dict=tags_dict)  ['object-detection', 'data-augmentation', 'generative-adversarial-networks', 'computer-vision']   1 2 3 4 5 # Prediction y_pred = [] for text in X_test: classes = get_classes(text, aliases, tags_dict) y_pred.append(classes)   1 2 # Encode labels y_pred = label_encoder.encode(y_pred)   1 2 3 # Evaluate performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=4))  { "precision": 0.8527917293434535, "recall": 0.38066760941576216, "f1": 0.48975323243320396, "num_samples": 480.0 }   1 2 3 # Inspection tag = "transformers" print (json.dumps(performance["class"][tag], indent=2))  { "precision": 1.0, "recall": 0.32, "f1": 0.48484848484848486, "num_samples": 25.0 }  ### Stemmed Before we do a more involved analysis, let's see if we can do better. We're looking for exact matches with the aliases which isn't always perfect, for example:  1 2 3 4 print (aliases[preprocess('gan')]) # print (aliases[preprocess('gans')]) # this won't find any match print (aliases[preprocess('generative adversarial networks')]) # print (aliases[preprocess('generative adversarial network')]) # this won't find any match  generative-adversarial-networks generative-adversarial-networks  We don't want to keep adding explicit rules but we can use stemming to represent different forms of a word uniformly, for example:  1 2 print (porter.stem("democracy")) print (porter.stem("democracies"))  democraci democraci  So let's now stem our aliases as well as the tokens in our input text and then look for matches.  1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df)   1 2 3 4 5 6 7 # Map aliases aliases = {} for tag, values in tags_dict.items(): aliases[preprocess(tag, stem=True)] = tag for alias in values['aliases']: aliases[preprocess(alias, stem=True)] = tag aliases  {'ae': 'autoencoders', 'attent': 'attention', 'autoencod': 'autoencoders', 'cnn': 'convolutional-neural-networks', 'comput vision': 'computer-vision', ... 'vision': 'computer-vision', 'wandb': 'wandb', 'weight bias': 'wandb'}   1 2 3 4 5 # Checks (we will write proper tests soon) print (aliases[preprocess('gan', stem=True)]) print (aliases[preprocess('gans', stem=True)]) print (aliases[preprocess('generative adversarial network', stem=True)]) print (aliases[preprocess('generative adversarial networks', stem=True)])  generative-adversarial-networks generative-adversarial-networks generative-adversarial-networks generative-adversarial-networks  We'll write proper tests for all of these functions when we move our code to Python scripts.  1 2 3 # Sample text = "This project extends gans for data augmentation specifically for object detection tasks." get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  ['object-detection', 'data-augmentation', 'generative-adversarial-networks', 'computer-vision']   1 2 3 4 5 # Prediction y_pred = [] for text in X_test: classes = get_classes(text, aliases, tags_dict) y_pred.append(classes)   1 2 # Encode labels y_pred = label_encoder.encode(y_pred)  ### Evaluation We can look at overall and per-class performance on our test set. Note When considering overall and per-class performance across different models, we should be aware of Simpson's paradox where a model can perform better on every class subset but not overall.  1 2 3 # Evaluate performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=4))  { "precision": 0.8405837971552256, "recall": 0.48656350456551384, "f1": 0.5794244643481148, "num_samples": 473.0 }   1 2 3 # Inspection tag = "transformers" print (json.dumps(performance["class"][tag], indent=2))  { "precision": 0.9285714285714286, "recall": 0.48148148148148145, "f1": 0.6341463414634146, "num_samples": 27.0 }    1 2 3 4 5 6 7 8 9 10 11 12 # TP, FP, FN samples index = label_encoder.class_to_index[tag] tp, fp, fn = [], [], [] for i in range(len(y_test)): true = y_test[i][index] pred = y_pred[i][index] if true and pred: tp.append(i) elif not true and pred: fp.append(i) elif true and not pred: fn.append(i)   1 2 3 print (tp) print (fp) print (fn)  [1, 14, 15, 28, 46, 54, 94, 160, 165, 169, 190, 194, 199] [49] [4, 18, 61, 63, 72, 75, 89, 99, 137, 141, 142, 163, 174, 206]   1 2 3 4 index = tp[0] print (X_test[index]) print (f"true: {label_encoder.decode([y_test[index]])[0]}") print (f"pred: {label_encoder.decode([y_pred[index]])[0]}\n")  insight project insight design creat nlp servic code base front end gui streamlit backend server fastapi usag transform true: ['attention', 'huggingface', 'natural-language-processing', 'pytorch', 'transfer-learning', 'transformers'] pred: ['natural-language-processing', 'transformers']   1 2 3 # Sorted tags sorted_tags_by_f1 = OrderedDict(sorted( performance['class'].items(), key=lambda tag: tag[1]['f1'], reverse=True))    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 @widgets.interact(tag=list(sorted_tags_by_f1.keys())) def display_tag_analysis(tag='transformers'): # Performance print (json.dumps(performance["class"][tag], indent=2)) # TP, FP, FN samples index = label_encoder.class_to_index[tag] tp, fp, fn = [], [], [] for i in range(len(y_test)): true = y_test[i][index] pred = y_pred[i][index] if true and pred: tp.append(i) elif not true and pred: fp.append(i) elif true and not pred: fn.append(i) # Samples num_samples = 3 if len(tp): print ("\n=== True positives ===\n") for i in tp[:num_samples]: print (f" {X_test[i]}") print (f" true: {label_encoder.decode([y_test[i]])[0]}") print (f" pred: {label_encoder.decode([y_pred[i]])[0]}\n") if len(fp): print ("=== False positives ===\n") for i in fp[:num_samples]: print (f" {X_test[i]}") print (f" true: {label_encoder.decode([y_test[i]])[0]}") print (f" pred: {label_encoder.decode([y_pred[i]])[0]}\n") if len(fn): print ("=== False negatives ===\n") for i in fn[:num_samples]: print (f" {X_test[i]}") print (f" true: {label_encoder.decode([y_test[i]])[0]}") print (f" pred: {label_encoder.decode([y_pred[i]])[0]}\n")  This is the output for the transformers tag: { "precision": 0.9285714285714286, "recall": 0.48148148148148145, "f1": 0.6341463414634146, "num_samples": 27.0 } === True positives === insight project insight design creat nlp servic code base front end gui streamlit backend server fastapi usag transform true: ['attention', 'huggingface', 'natural-language-processing', 'pytorch', 'transfer-learning', 'transformers'] pred: ['natural-language-processing', 'transformers'] hyperparamet optim transform guid basic grid search optim fact hyperparamet choos signific impact final model perform true: ['natural-language-processing', 'transformers'] pred: ['natural-language-processing', 'transformers'] transform neural network architectur explain time explain transform work look easi explan exactli right true: ['attention', 'natural-language-processing', 'transformers'] pred: ['natural-language-processing', 'transformers'] === False positives === multi target albument mani imag mani mask bound box key point transform sync true: ['computer-vision', 'data-augmentation'] pred: ['natural-language-processing', 'transformers'] === False negatives === size fill blank multi mask fill roberta size fill blank condit text fill idea fill miss word sentenc probabl choic word true: ['attention', 'huggingface', 'language-modeling', 'natural-language-processing', 'transformers'] pred: [] gpt3 work visual anim compil thread explain gpt3 true: ['natural-language-processing', 'transformers'] pred: [] tinybert tinybert 7 5x smaller 9 4x faster infer bert base achiev competit perform task natur languag understand true: ['attention', 'natural-language-processing', 'transformers'] pred: []  Note You can use false positives/negatives to discover potential errors in annotation. This can be especially useful when analyzing FP/FNs from rule-based approaches. Though we achieved decent precision, the recall is quite low. This is because rule-based approaches can yield labels with high certainty when there is an absolute condition match but it fails to generalize or learn implicit patterns. ### Inference  1 2 3 4 # Infer text = "Transfer learning with transformers for self-supervised learning" print (preprocess(text, stem=True)) get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  transfer learn transform self supervis learn ['self-supervised-learning', 'transfer-learning', 'transformers', 'natural-language-processing']  Now let's see what happens when we replace the word transformers with BERT. Sure we can add this as an alias but we can't keep doing this. This is where it makes sense to learn from the data as opposed to creating explicit rules.  1 2 3 4 # Infer text = "Transfer learning with BERT for self-supervised learning" print (preprocess(text, stem=True)) get_classes(text=preprocess(text, stem=True), aliases=aliases, tags_dict=tags_dict)  transfer learn bert self supervis learn ['self-supervised-learning', 'transfer-learning']  limitations: we failed to generalize or learn any implicit patterns to predict the labels because we treat the tokens in our input as isolated entities. Note We would ideally spend more time tuning our model because it's so simple and quick to train. This approach also applies to all the other models we'll look at as well. ## Simple ML motivation: • representation: use term frequency-inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs, as opposed to treating the words in our input text as isolated tokens. • architecture: we want our model to meaningfully extract the encoded signal to predict the output labels. So far we've treated the words in our input text as isolated tokens and we haven't really captured any meaning between tokens. Let's use term frequency–inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs. $w_{i, j} = \text{tf}_{i, j} * log(\frac{N}{\text{df}_i})$ Variable Description $$w_{i, j}$$ tf-idf weight for term $$i$$ in document $$j$$ $$\text{tf}_{i, j}$$ # of times term $$i$$ appear in document $$j$$ $$N$$ total # of documents$
$$\text{df}_i$$ # of documents with token $$i$$

 1 2 3 4 5 6 from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import LinearSVC 
 1 2 3 4 5 from sklearn import metrics from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics import accuracy_score, precision_score, recall_score from sklearn.metrics import precision_recall_curve from sklearn.preprocessing import MultiLabelBinarizer 
 1 2 # Set seeds set_seeds() 
 1 2 3 4 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) 
 1 2 3 4 5 6 7 8 # Tf-idf vectorizer = TfidfVectorizer() print (X_train[0]) X_train = vectorizer.fit_transform(X_train) X_val = vectorizer.transform(X_val) X_test = vectorizer.transform(X_test) print (X_train.shape) print (X_train[0]) # scipy.sparse.csr_matrix 

albument fast imag augment librari easi use wrapper around librari
(1000, 2654)
(0, 190)  0.34307733697679055
(0, 2630) 0.3991510203964918
(0, 2522) 0.14859192074955896
(0, 728)  0.29210630687446
(0, 1356) 0.4515371929370289
(0, 217)  0.2870036535570893
(0, 1157) 0.18851186612963625
(0, 876)  0.31431481238098835
(0, 118)  0.44156912440424356


 1 2 3 4 5 6 7 def fit_and_evaluate(model): """Fit and evaluate each model.""" model.fit(X_train, y_train) y_pred = model.predict(X_test) performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=list(label_encoder.classes)) return performance['overall'] 
  1 2 3 4 5 6 7 8 9 10 11 12 13 # Models performance = {} performance['logistic-regression'] = fit_and_evaluate(OneVsRestClassifier( LogisticRegression(), n_jobs=1)) performance['k-nearest-neighbors'] = fit_and_evaluate( KNeighborsClassifier()) performance['random-forest'] = fit_and_evaluate( RandomForestClassifier(n_jobs=-1)) performance['gradient-boosting-machine'] = fit_and_evaluate(OneVsRestClassifier( GradientBoostingClassifier())) performance['support-vector-machine'] = fit_and_evaluate(OneVsRestClassifier( LinearSVC(), n_jobs=-1)) print (json.dumps(performance, indent=2)) 

{
"logistic-regression": {
"precision": 0.3563624338624338,
"recall": 0.0858365150175495,
"f1": 0.13067443826527078,
"num_samples": 480.0
},
"k-nearest-neighbors": {
"precision": 0.6172562358276645,
"recall": 0.3213868500136974,
"f1": 0.400741288236766,
"num_samples": 480.0
},
"random-forest": {
"precision": 0.5851306333244963,
"recall": 0.21548369514995133,
"f1": 0.29582560665419344,
"num_samples": 480.0
},
"precision": 0.7104917071723794,
"recall": 0.5106819976684509,
"f1": 0.575225354377256,
"num_samples": 480.0
},
"support-vector-machine": {
"precision": 0.8059313061625735,
"recall": 0.40445445906037036,
"f1": 0.5164548230244397,
"num_samples": 480.0
}
}


limitations:

• representation: TF-IDF representations don't encapsulate much signal beyond frequency but we require more fine-grained token representations.
• architecture: we want to develop models that can use better represented encodings in a more contextual manner.

## CNN w/ Embeddings

motivation:

• representation: we want to have more robust (split tokens to characters) and meaningful embeddings representations for our input tokens.
• architecture: we want to process our encoded inputs using convolution (CNN) filters that can learn to analyze windows of embedded tokens to extract meaningful signal.

### Set up

We'll set up the task by setting seeds for reproducibility, creating our data splits abd setting the device.

 1 2 3 4 import math import torch import torch.nn as nn import torch.nn.functional as F 
 1 2 # Set seeds set_seeds() 
 1 2 3 4 5 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) X_test_raw = X_test 
 1 2 3 4 5 6 7 8 # Set device cuda = True device = torch.device('cuda' if ( torch.cuda.is_available() and cuda) else 'cpu') torch.set_default_tensor_type('torch.FloatTensor') if device.type == 'cuda': torch.set_default_tensor_type('torch.cuda.FloatTensor') print (device) 

cuda


### Tokenizer

We're going to tokenize our input text as character tokens so we can be robust to spelling errors and learn to generalize across tags. (ex. learning that RoBERTa, or any other future BERT based archiecture, warrants same tag as BERT).

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 class Tokenizer(object): def __init__(self, char_level, num_tokens=None, pad_token='', oov_token='', token_to_index=None): self.char_level = char_level self.separator = '' if self.char_level else ' ' if num_tokens: num_tokens -= 2 # pad + unk tokens self.num_tokens = num_tokens self.pad_token = pad_token self.oov_token = oov_token if not token_to_index: token_to_index = {pad_token: 0, oov_token: 1} self.token_to_index = token_to_index self.index_to_token = {v: k for k, v in self.token_to_index.items()} def __len__(self): return len(self.token_to_index) def __str__(self): return f"" def fit_on_texts(self, texts): if not self.char_level: texts = [text.split(" ") for text in texts] all_tokens = [token for text in texts for token in text] counts = Counter(all_tokens).most_common(self.num_tokens) self.min_token_freq = counts[-1][1] for token, count in counts: index = len(self) self.token_to_index[token] = index self.index_to_token[index] = token return self def texts_to_sequences(self, texts): sequences = [] for text in texts: if not self.char_level: text = text.split(' ') sequence = [] for token in text: sequence.append(self.token_to_index.get( token, self.token_to_index[self.oov_token])) sequences.append(np.asarray(sequence)) return sequences def sequences_to_texts(self, sequences): texts = [] for sequence in sequences: text = [] for index in sequence: text.append(self.index_to_token.get(index, self.oov_token)) texts.append(self.separator.join([token for token in text])) return texts def save(self, fp): with open(fp, 'w') as fp: contents = { 'char_level': self.char_level, 'oov_token': self.oov_token, 'token_to_index': self.token_to_index } json.dump(contents, fp, indent=4, sort_keys=False) @classmethod def load(cls, fp): with open(fp, 'r') as fp: kwargs = json.load(fp=fp) return cls(**kwargs) 
 1 2 3 4 5 6 # Tokenize char_level = True tokenizer = Tokenizer(char_level=char_level) tokenizer.fit_on_texts(texts=X_train) vocab_size = len(tokenizer) print (tokenizer) 

<Tokenizer(num_tokens=39)>

 1 tokenizer.token_to_index 
{' ': 2,
'0': 30,
'1': 31,
'2': 26,
...
'<UNK>': 1,
...
'x': 25,
'y': 21,
'z': 27}

 1 2 3 4 5 6 7 8 # Convert texts to sequences of indices X_train = np.array(tokenizer.texts_to_sequences(X_train)) X_val = np.array(tokenizer.texts_to_sequences(X_val)) X_test = np.array(tokenizer.texts_to_sequences(X_test)) preprocessed_text = tokenizer.sequences_to_texts([X_train[0]])[0] print ("Text to indices:\n" f" (preprocessed) → {preprocessed_text}\n" f" (tokenized) → {X_train[0]}") 
Text to indices:
(preprocessed) → albumentations fast image augmentation library easy use wrapper around libraries
(tokenized) → [ 7 11 20 17 16  3  5  6  7  6  4 10  5  9  2 19  7  9  6  2  4 16  7 14
3  2  7 17 14 16  3  5  6  7  6  4 10  5  2 11  4 20  8  7  8 21  2  3
7  9 21  2 17  9  3  2 23  8  7 13 13  3  8  2  7  8 10 17  5 15  2 11
4 20  8  7  8  4  3  9]


### Data imbalance

We'll factor class weights in our objective function (binary cross entropy with logits) to help with class imbalance. There are many other techniques such as over sampling from underrepresented classes, undersampling, etc. but we'll cover these in a separate unit lesson on data imbalance.

 1 2 3 4 # Class weights counts = np.bincount([label_encoder.class_to_index[class_] for class_ in all_tags]) class_weights = {i: 1.0/count for i, count in enumerate(counts)} print (f"class counts: {counts},\nclass weights: {class_weights}") 
class counts: [120  41 388 106  41  75  34  73  51  78  64  51  55  93  51 429  33  69
30  51 258  32  49  59  57  60  48  40 213  40  34  46 196  39  39],
class weights: {0: 0.008333333333333333, 1: 0.024390243902439025, 2: 0.002577319587628866, 3: 0.009433962264150943, 4: 0.024390243902439025, 5: 0.013333333333333334, 6: 0.029411764705882353, 7: 0.0136986301369863, 8: 0.0196078431372549, 9: 0.01282051282051282, 10: 0.015625, 11: 0.0196078431372549, 12: 0.01818181818181818, 13: 0.010752688172043012, 14: 0.0196078431372549, 15: 0.002331002331002331, 16: 0.030303030303030304, 17: 0.014492753623188406, 18: 0.03333333333333333, 19: 0.0196078431372549, 20: 0.003875968992248062, 21: 0.03125, 22: 0.02040816326530612, 23: 0.01694915254237288, 24: 0.017543859649122806, 25: 0.016666666666666666, 26: 0.020833333333333332, 27: 0.025, 28: 0.004694835680751174, 29: 0.025, 30: 0.029411764705882353, 31: 0.021739130434782608, 32: 0.00510204081632653, 33: 0.02564102564102564, 34: 0.02564102564102564}


### Datasets

We're going to place our data into a Dataset and use a DataLoader to efficiently create batches for training and evaluation.

 1 2 3 4 5 6 7 def pad_sequences(sequences, max_seq_len=0): """Pad sequences to max length in sequence.""" max_seq_len = max(max_seq_len, max(len(sequence) for sequence in sequences)) padded_sequences = np.zeros((len(sequences), max_seq_len)) for i, sequence in enumerate(sequences): padded_sequences[i][:len(sequence)] = sequence return padded_sequences 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 class CNNTextDataset(torch.utils.data.Dataset): def __init__(self, X, y, max_filter_size): self.X = X self.y = y self.max_filter_size = max_filter_size def __len__(self): return len(self.y) def __str__(self): return f"" def __getitem__(self, index): X = self.X[index] y = self.y[index] return [X, y] def collate_fn(self, batch): """Processing on a batch.""" # Get inputs batch = np.array(batch, dtype=object) X = batch[:, 0] y = np.stack(batch[:, 1], axis=0) # Pad inputs X = pad_sequences(sequences=X, max_seq_len=self.max_filter_size) # Cast X = torch.LongTensor(X.astype(np.int32)) y = torch.FloatTensor(y.astype(np.int32)) return X, y def create_dataloader(self, batch_size, shuffle=False, drop_last=False): return torch.utils.data.DataLoader( dataset=self, batch_size=batch_size, collate_fn=self.collate_fn, shuffle=shuffle, drop_last=drop_last, pin_memory=True) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Create datasets filter_sizes = list(range(1, 11)) train_dataset = CNNTextDataset( X=X_train, y=y_train, max_filter_size=max(filter_sizes)) val_dataset = CNNTextDataset( X=X_val, y=y_val, max_filter_size=max(filter_sizes)) test_dataset = CNNTextDataset( X=X_test, y=y_test, max_filter_size=max(filter_sizes)) print ("Data splits:\n" f" Train dataset:{train_dataset.__str__()}\n" f" Val dataset: {val_dataset.__str__()}\n" f" Test dataset: {test_dataset.__str__()}\n" "Sample point:\n" f" X: {train_dataset[0][0]}\n" f" y: {train_dataset[0][1]}") 

Data splits:
Train dataset:
Val dataset:
Test dataset:
Sample point:
X: [ 7 11 20 17 16  3  5  6  7  6  4 10  5  9  2 19  7  9  6  2  4 16  7 14
3  2  7 17 14 16  3  5  6  7  6  4 10  5  2 11  4 20  8  7  8 21  2  3
7  9 21  2 17  9  3  2 23  8  7 13 13  3  8  2  7  8 10 17  5 15  2 11
4 20  8  7  8  4  3  9]
y: [0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

  1 2 3 4 5 6 7 8 9 10 11 12 # Create dataloaders batch_size = 64 train_dataloader = train_dataset.create_dataloader( batch_size=batch_size) val_dataloader = val_dataset.create_dataloader( batch_size=batch_size) test_dataloader = test_dataset.create_dataloader( batch_size=batch_size) batch_X, batch_y = next(iter(train_dataloader)) print ("Sample batch:\n" f" X: {list(batch_X.size())}\n" f" y: {list(batch_y.size())}") 
Sample batch:
X: [64, 186]
y: [64, 35]


### Model

We'll be using a convolutional neural network on top of our embedded tokens to extract meaningful spatial signal. This time, we'll be using many filter widths to act as n-gram feature extractors. If you're not familiar with CNNs be sure to check out the CNN lesson where we walkthrough every component of the architecture.

Let's visualize the model's forward pass.

1. We'll first tokenize our inputs (batch_size, max_seq_len).
2. Then we'll embed our tokenized inputs (batch_size, max_seq_len, embedding_dim).
3. We'll apply convolution via filters (filter_size, vocab_size, num_filters) followed by batch normalization. Our filters act as character level n-gram detecors. We have three different filter sizes (2, 3 and 4) and they will act as bi-gram, tri-gram and 4-gram feature extractors, respectivelyy.
4. We'll apply 1D global max pooling which will extract the most relevant information from the feature maps for making the decision.
5. We feed the pool outputs to a fully-connected (FC) layer (with dropout).
6. We use one more FC layer with softmax to derive class probabilities.

 1 2 3 4 5 # Arguments embedding_dim = 128 num_filters = 128 hidden_dim = 128 dropout_p = 0.5 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 class CNN(nn.Module): def __init__(self, embedding_dim, vocab_size, num_filters, filter_sizes, hidden_dim, dropout_p, num_classes, padding_idx=0): super(CNN, self).__init__() # Initialize embeddings self.embeddings = nn.Embedding( embedding_dim=embedding_dim, num_embeddings=vocab_size, padding_idx=padding_idx) # Conv weights self.filter_sizes = filter_sizes self.conv = nn.ModuleList( [nn.Conv1d(in_channels=embedding_dim, out_channels=num_filters, kernel_size=f) for f in filter_sizes]) # FC weights self.dropout = nn.Dropout(dropout_p) self.fc1 = nn.Linear(num_filters*len(filter_sizes), hidden_dim) self.fc2 = nn.Linear(hidden_dim, num_classes) def forward(self, inputs, channel_first=False): # Embed x_in, = inputs x_in = self.embeddings(x_in) if not channel_first: x_in = x_in.transpose(1, 2) # (N, channels, sequence length) z = [] max_seq_len = x_in.shape[2] for i, f in enumerate(self.filter_sizes): # SAME padding padding_left = int( (self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2) padding_right = int(math.ceil( (self.conv[i].stride[0]*(max_seq_len-1) - max_seq_len + self.filter_sizes[i])/2)) # Conv _z = self.conv[i](F.pad(x_in, (padding_left, padding_right))) # Pool _z = F.max_pool1d(_z, _z.size(2)).squeeze(2) z.append(_z) # Concat outputs z = torch.cat(z, 1) # FC z = self.fc1(z) z = self.dropout(z) z = self.fc2(z) return z 

• VALID: no padding, the filters only use the "valid" values in the input. If the filter cannot reach all the input values (filters go left to right), the extra values on the right are dropped.
• SAME: adds padding evenly to the right (preferred) and left sides of the input so that all values in the input are processed.

We're add SAME padding so that the convolutional outputs are the same shape as our inputs. The amount of padding for the SAME padding can be determined using the same equation. We want out output to have the same width as our input, so we solve for P:

$\frac{W-F+2P}{S} + 1 = W$
$P = \frac{S(W-1) - W + F}{2}$

If $$P$$ is not a whole number, we round up (using math.ceil) and place the extra padding on the right side.

 1 2 3 4 5 6 7 8 # Initialize model model = CNN( embedding_dim=embedding_dim, vocab_size=vocab_size, num_filters=num_filters, filter_sizes=filter_sizes, hidden_dim=hidden_dim, dropout_p=dropout_p, num_classes=num_classes) model = model.to(device) print (model.named_parameters) 
bound method Module.named_parameters of CNN(
(conv): ModuleList(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): Conv1d(128, 128, kernel_size=(2,), stride=(1,))
(2): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
(3): Conv1d(128, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(128, 128, kernel_size=(5,), stride=(1,))
(5): Conv1d(128, 128, kernel_size=(6,), stride=(1,))
(6): Conv1d(128, 128, kernel_size=(7,), stride=(1,))
(7): Conv1d(128, 128, kernel_size=(8,), stride=(1,))
(8): Conv1d(128, 128, kernel_size=(9,), stride=(1,))
(9): Conv1d(128, 128, kernel_size=(10,), stride=(1,))
)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=1280, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=35, bias=True)
)


### Training

 1 2 3 4 # Arguments lr = 2e-4 num_epochs = 200 patience = 10 
 1 2 3 # Define loss class_weights_tensor = torch.Tensor(np.array(list(class_weights.values()))) loss = nn.BCEWithLogitsLoss(weight=class_weights_tensor) 
 1 2 3 4 # Define optimizer & scheduler optimizer = torch.optim.Adam(model.parameters(), lr=lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode='min', factor=0.1, patience=5) 
 1 2 3 4 # Trainer module trainer = Trainer( model=model, device=device, loss_fn=loss_fn, optimizer=optimizer, scheduler=scheduler) 
 1 2 3 # Train best_model = trainer.train( num_epochs, patience, train_dataloader, val_dataloader) 

Epoch: 1 | train_loss: 0.00624, val_loss: 0.00285, lr: 2.00E-04, _patience: 10
Epoch: 2 | train_loss: 0.00401, val_loss: 0.00283, lr: 2.00E-04, _patience: 10
Epoch: 3 | train_loss: 0.00362, val_loss: 0.00266, lr: 2.00E-04, _patience: 10
Epoch: 4 | train_loss: 0.00332, val_loss: 0.00263, lr: 2.00E-04, _patience: 10
...
Epoch: 49 | train_loss: 0.00061, val_loss: 0.00149, lr: 2.00E-05, _patience: 4
Epoch: 50 | train_loss: 0.00055, val_loss: 0.00159, lr: 2.00E-05, _patience: 3
Epoch: 51 | train_loss: 0.00056, val_loss: 0.00152, lr: 2.00E-05, _patience: 2
Epoch: 52 | train_loss: 0.00057, val_loss: 0.00156, lr: 2.00E-05, _patience: 1
Stopping early!


### Evaluation

 1 2 from pathlib import Path from sklearn.metrics import precision_recall_curve 
 1 2 3 4 5 6 7 8 # Threshold-PR curve train_loss, y_true, y_prob = trainer.eval_step(dataloader=train_dataloader) precisions, recalls, thresholds = precision_recall_curve(y_true.ravel(), y_prob.ravel()) plt.plot(thresholds, precisions[:-1], "r--", label="Precision") plt.plot(thresholds, recalls[:-1], "b-", label="Recall") plt.ylabel("Performance") plt.xlabel("Threshold") plt.legend(loc='best') 

 1 2 3 4 5 6 # Determining the best threshold def find_best_threshold(y_true, y_prob): """Find the best threshold for maximum F1.""" precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob) f1s = (2 * precisions * recalls) / (precisions + recalls) return thresholds[np.argmax(f1s)] 
 1 2 3 # Best threshold for f1 threshold = find_best_threshold(y_true.ravel(), y_prob.ravel()) threshold 

0.23890994


 1 2 3 # Determine predictions using threshold test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) 
 1 2 3 4 # Evaluate performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=2)) 

{
"precision": 0.8134201912838206,
"recall": 0.5244273766053323,
"f1": 0.6134741297877828,
"num_samples": 480.0
}


We can do the same type of inspection as with the rule-based baseline. This is the output for the transformers tag:

{
"precision": 0.782608695652174,
"recall": 0.72,
"f1": 0.7499999999999999,
"num_samples": 25.0
}

=== True positives ===

insight project insight designed create nlp service code base front end gui streamlit backend server fastapi usage transformers
true: ['attention', 'huggingface', 'natural-language-processing', 'pytorch', 'transfer-learning', 'transformers']
pred: ['natural-language-processing', 'transformers']

transformer neural network architecture explained time explain transformers work looking easy explanation exactly right
true: ['attention', 'natural-language-processing', 'transformers']
pred: ['natural-language-processing', 'transformers']

multi task training hugging face transformers nlp recipe multi task training transformers trainer nlp datasets
true: ['huggingface', 'natural-language-processing', 'transformers']
pred: ['huggingface', 'natural-language-processing', 'transformers']

=== False positives ===

evaluation metrics language modeling article focus traditional intrinsic metrics extremely useful process training language model
true: ['language-modeling', 'natural-language-processing']
pred: ['language-modeling', 'natural-language-processing', 'transformers']

multi target albumentations many images many masks bounding boxes key points transform sync
true: ['computer-vision', 'data-augmentation']
pred: ['computer-vision', 'natural-language-processing', 'transformers']

lda2vec tools interpreting natural language lda2vec model tries mix best parts word2vec lda single framework
true: ['embeddings', 'interpretability', 'natural-language-processing']
pred: ['natural-language-processing', 'transformers']

=== False negatives ===

sized fill blank multi mask filling roberta sized fill blank conditional text filling idea filling missing words sentence probable choice words
true: ['attention', 'huggingface', 'language-modeling', 'natural-language-processing', 'transformers']
pred: ['natural-language-processing']

gpt3 works visualizations animations compilation threads explaining gpt3
true: ['natural-language-processing', 'transformers']
pred: ['interpretability', 'natural-language-processing']

multimodal meme classification uniter given state art results various image text related problems project aims finetuning uniter solve hateful memes challenge
true: ['attention', 'computer-vision', 'image-classification', 'natural-language-processing', 'transformers']
pred: ['computer-vision']

 1 2 3 4 5 6 7 8 # Save artifacts dir = Path("cnn") dir.mkdir(parents=True, exist_ok=True) tokenizer.save(fp=Path(dir, 'tokenzier.json')) label_encoder.save(fp=Path(dir, 'label_encoder.json')) torch.save(best_model.state_dict(), Path(dir, 'model.pt')) with open(Path(dir, 'performance.json'), "w") as fp: json.dump(performance, indent=2, sort_keys=False, fp=fp) 

### Inference

  1 2 3 4 5 6 7 8 9 10 11 # Load artifacts device = torch.device("cpu") tokenizer = Tokenizer.load(fp=Path(dir, 'tokenzier.json')) label_encoder = LabelEncoder.load(fp=Path(dir, 'label_encoder.json')) model = CNN( embedding_dim=embedding_dim, vocab_size=vocab_size, num_filters=num_filters, filter_sizes=filter_sizes, hidden_dim=hidden_dim, dropout_p=dropout_p, num_classes=num_classes) model.load_state_dict(torch.load(Path(dir, 'model.pt'), map_location=device)) model.to(device) 
CNN(
(conv): ModuleList(
(0): Conv1d(128, 128, kernel_size=(1,), stride=(1,))
(1): Conv1d(128, 128, kernel_size=(2,), stride=(1,))
(2): Conv1d(128, 128, kernel_size=(3,), stride=(1,))
(3): Conv1d(128, 128, kernel_size=(4,), stride=(1,))
(4): Conv1d(128, 128, kernel_size=(5,), stride=(1,))
(5): Conv1d(128, 128, kernel_size=(6,), stride=(1,))
(6): Conv1d(128, 128, kernel_size=(7,), stride=(1,))
(7): Conv1d(128, 128, kernel_size=(8,), stride=(1,))
(8): Conv1d(128, 128, kernel_size=(9,), stride=(1,))
(9): Conv1d(128, 128, kernel_size=(10,), stride=(1,))
)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=1280, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=35, bias=True)
)


 1 2 # Initialize trainer trainer = Trainer(model=model, device=device) 
 1 2 3 4 5 6 7 8 # Dataloader text = "Transfer learning with BERT for self-supervised learning" X = np.array(tokenizer.texts_to_sequences([preprocess(text)])) y_filler = label_encoder.encode([np.array([label_encoder.classes[0]]*len(X))]) dataset = CNNTextDataset( X=X, y=y_filler, max_filter_size=max(filter_sizes)) dataloader = dataset.create_dataloader( batch_size=batch_size) 
 1 2 3 4 # Inference y_prob = trainer.predict_step(dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) label_encoder.decode(y_pred) 

[['attention',
'natural-language-processing',
'self-supervised-learning',
'transfer-learning',
'transformers']]


limitations:

• representation: embeddings are not contextual.
• architecture: extracting signal from encoded inputs is limited by filter widths.

Note

Since we're dealing with simple architectures and fast training times, it's a good opportunity to explore tuning and experiment with k-fold cross validation to properly reach any conclusions about performance.

## RNN w/ Embeddings

motivation: let's see if processing our embedded tokens in a sequential fashion using recurrent neural networks (RNNs) can yield better performance.

### Set up

 1 2 # Set seeds set_seeds() 
 1 2 3 4 5 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) X_test_raw = X_test 
 1 2 3 4 5 6 7 8 # Set device cuda = True device = torch.device('cuda' if ( torch.cuda.is_available() and cuda) else 'cpu') torch.set_default_tensor_type('torch.FloatTensor') if device.type == 'cuda': torch.set_default_tensor_type('torch.cuda.FloatTensor') print (device) 

### Tokenizer

 1 2 3 4 5 6 7 # Tokenize char_level = True tokenizer = Tokenizer(char_level=char_level) tokenizer.fit_on_texts(texts=X_train) vocab_size = len(tokenizer) print ("X tokenizer:\n" f" {tokenizer}") 
X tokenizer:
<Tokenizer(num_tokens=39)>

 1 tokenizer.token_to_index 
{' ': 2,
'0': 30,
'1': 31,
'2': 26,
...
'<UNK>': 1,
...
'x': 25,
'y': 21,
'z': 27}

 1 2 3 4 5 6 7 8 # Convert texts to sequences of indices X_train = np.array(tokenizer.texts_to_sequences(X_train)) X_val = np.array(tokenizer.texts_to_sequences(X_val)) X_test = np.array(tokenizer.texts_to_sequences(X_test)) preprocessed_text = tokenizer.sequences_to_texts([X_train[0]])[0] print ("Text to indices:\n" f" (preprocessed) → {preprocessed_text}\n" f" (tokenized) → {X_train[0]}") 
Text to indices:
(preprocessed) → albumentations fast image augmentation library easy use wrapper around libraries
(tokenized) → [ 7 11 20 17 16  3  5  6  7  6  4 10  5  9  2 19  7  9  6  2  4 16  7 14
3  2  7 17 14 16  3  5  6  7  6  4 10  5  2 11  4 20  8  7  8 21  2  3
7  9 21  2 17  9  3  2 23  8  7 13 13  3  8  2  7  8 10 17  5 15  2 11
4 20  8  7  8  4  3  9]


### Data imbalance

We'll factor class weights in our objective function (binary cross entropy with logits) to help with class imbalance. There are many other techniques such as over sampling from underrepresented classes, undersampling, etc. but we'll cover these in a separate unit lesson on data imbalance.

 1 2 3 4 # Class weights counts = np.bincount([label_encoder.class_to_index[class_] for class_ in all_tags]) class_weights = {i: 1.0/count for i, count in enumerate(counts)} print (f"class counts: {counts},\nclass weights: {class_weights}") 
class counts: [120  41 388 106  41  75  34  73  51  78  64  51  55  93  51 429  33  69
30  51 258  32  49  59  57  60  48  40 213  40  34  46 196  39  39],
class weights: {0: 0.008333333333333333, 1: 0.024390243902439025, 2: 0.002577319587628866, 3: 0.009433962264150943, 4: 0.024390243902439025, 5: 0.013333333333333334, 6: 0.029411764705882353, 7: 0.0136986301369863, 8: 0.0196078431372549, 9: 0.01282051282051282, 10: 0.015625, 11: 0.0196078431372549, 12: 0.01818181818181818, 13: 0.010752688172043012, 14: 0.0196078431372549, 15: 0.002331002331002331, 16: 0.030303030303030304, 17: 0.014492753623188406, 18: 0.03333333333333333, 19: 0.0196078431372549, 20: 0.003875968992248062, 21: 0.03125, 22: 0.02040816326530612, 23: 0.01694915254237288, 24: 0.017543859649122806, 25: 0.016666666666666666, 26: 0.020833333333333332, 27: 0.025, 28: 0.004694835680751174, 29: 0.025, 30: 0.029411764705882353, 31: 0.021739130434782608, 32: 0.00510204081632653, 33: 0.02564102564102564, 34: 0.02564102564102564}


### Datasets

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 class RNNTextDataset(torch.utils.data.Dataset): def __init__(self, X, y): self.X = X self.y = y def __len__(self): return len(self.y) def __str__(self): return f"" def __getitem__(self, index): X = self.X[index] y = self.y[index] return [X, len(X), y] def collate_fn(self, batch): """Processing on a batch.""" # Get inputs batch = np.array(batch, dtype=object) X = batch[:, 0] seq_lens = batch[:, 1] y = np.stack(batch[:, 2], axis=0) # Pad inputs X = pad_sequences(sequences=X) # Cast X = torch.LongTensor(X.astype(np.int32)) seq_lens = torch.LongTensor(seq_lens.astype(np.int32)) y = torch.FloatTensor(y.astype(np.int32)) return X, seq_lens, y def create_dataloader(self, batch_size, shuffle=False, drop_last=False): return torch.utils.data.DataLoader( dataset=self, batch_size=batch_size, collate_fn=self.collate_fn, shuffle=shuffle, drop_last=drop_last, pin_memory=True) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Create datasets train_dataset = RNNTextDataset( X=X_train, y=y_train) val_dataset = RNNTextDataset( X=X_val, y=y_val) test_dataset = RNNTextDataset( X=X_test, y=y_test) print ("Data splits:\n" f" Train dataset:{train_dataset.__str__()}\n" f" Val dataset: {val_dataset.__str__()}\n" f" Test dataset: {test_dataset.__str__()}\n" "Sample point:\n" f" X: {train_dataset[0][0]}\n" f" seq_len: {train_dataset[0][1]}\n" f" y: {train_dataset[0][2]}") 

Data splits:
Train dataset:
Val dataset:
Test dataset:
Sample point:
X: [ 7 11 20 17 16  3  5  6  7  6  4 10  5  9  2 19  7  9  6  2  4 16  7 14
3  2  7 17 14 16  3  5  6  7  6  4 10  5  2 11  4 20  8  7  8 21  2  3
7  9 21  2 17  9  3  2 23  8  7 13 13  3  8  2  7  8 10 17  5 15  2 11
4 20  8  7  8  4  3  9]
seq_len: 80
y: [0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Create dataloaders batch_size = 64 train_dataloader = train_dataset.create_dataloader( batch_size=batch_size) val_dataloader = val_dataset.create_dataloader( batch_size=batch_size) test_dataloader = test_dataset.create_dataloader( batch_size=batch_size) batch_X, batch_seq_lens, batch_y = next(iter(train_dataloader)) print (batch_X.shape) print ("Sample batch:\n" f" X: {list(batch_X.size())}\n" f" seq_lens: {list(batch_seq_lens.size())}\n" f" y: {list(batch_y.size())}") 
torch.Size([64, 186])
Sample batch:
X: [64, 186]
seq_lens: [64]
y: [64, 35]


### Model

We'll be using a recurrent neural network to process our embedded tokens one at a time (sequentially). If you're not familiar with RNNs be sure to check out the RNN lesson where we walkthrough every component of the architecture.

$$\text{RNN forward pass for a single time step } X_t$$:

$h_t = tanh(W_{hh}h_{t-1} + W_{xh}X_t+b_h)$

Variable Description
$$N$$ batch size
$$E$$ embeddings dimension
$$H$$ # of hidden units
$$W_{hh}$$ RNN weights $$\in \mathbb{R}^{HXH}$$
$$h_{t-1}$$ previous timestep's hidden state $$\in in \mathbb{R}^{NXH}$$
$$W_{xh}$$ input weights $$\in \mathbb{R}^{EXH}$$
$$X_t$$ input at time step $$t \in \mathbb{R}^{NXE}$$
$$b_h$$ hidden units bias $$\in \mathbb{R}^{HX1}$$
$$h_t$$ output from RNN for timestep $$t$$

 1 2 3 4 5 # Arguments embedding_dim = 128 rnn_hidden_dim = 128 hidden_dim = 128 dropout_p = 0.5 
 1 2 3 4 5 6 7 8 def gather_last_relevant_hidden(hiddens, seq_lens): """Extract and collect the last relevant hidden state based on the sequence length.""" seq_lens = seq_lens.long().detach().cpu().numpy() - 1 out = [] for batch_index, column_index in enumerate(seq_lens): out.append(hiddens[batch_index, column_index]) return torch.stack(out) 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 class RNN(nn.Module): def __init__(self, embedding_dim, vocab_size, rnn_hidden_dim, hidden_dim, dropout_p, num_classes, padding_idx=0): super(RNN, self).__init__() # Initialize embeddings self.embeddings = nn.Embedding(embedding_dim=embedding_dim, num_embeddings=vocab_size, padding_idx=padding_idx) # RNN self.rnn = nn.GRU(embedding_dim, rnn_hidden_dim, batch_first=True, bidirectional=True) # FC weights self.dropout = nn.Dropout(dropout_p) self.fc1 = nn.Linear(rnn_hidden_dim*2, hidden_dim) self.fc2 = nn.Linear(hidden_dim, num_classes) def forward(self, inputs): # Inputs x_in, seq_lens = inputs # Embed x_in = self.embeddings(x_in) # Rnn outputs out, h_n = self.rnn(x_in) z = gather_last_relevant_hidden(hiddens=out, seq_lens=seq_lens) # FC layers z = self.fc1(z) z = self.dropout(z) z = self.fc2(z) return z 
 1 2 3 4 5 6 7 # Initialize model model = RNN( embedding_dim=embedding_dim, vocab_size=vocab_size, rnn_hidden_dim=rnn_hidden_dim, hidden_dim=hidden_dim, dropout_p=dropout_p, num_classes=num_classes) model = model.to(device) print (model.named_parameters) 

bound method Module.named_parameters of RNN(
(rnn): GRU(128, 128, batch_first=True, bidirectional=True)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=256, out_features=128, bias=True)
(fc2): Linear(in_features=128, out_features=35, bias=True)
)


### Training

 1 2 3 4 # Arguments lr = 2e-3 num_epochs = 200 patience = 10 
 1 2 3 # Define loss class_weights_tensor = torch.Tensor(np.array(list(class_weights.values()))) loss = nn.BCEWithLogitsLoss(weight=class_weights_tensor) 
 1 2 3 4 # Define optimizer & scheduler optimizer = torch.optim.Adam(model.parameters(), lr=lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode='min', factor=0.1, patience=5) 
 1 2 3 4 # Trainer module trainer = Trainer( model=model, device=device, loss_fn=loss_fn, optimizer=optimizer, scheduler=scheduler) 
 1 2 3 # Train best_model = trainer.train( num_epochs, patience, train_dataloader, val_dataloader) 

Epoch: 1 | train_loss: 0.00612, val_loss: 0.00328, lr: 2.00E-03, _patience: 10
Epoch: 2 | train_loss: 0.00325, val_loss: 0.00276, lr: 2.00E-03, _patience: 10
Epoch: 3 | train_loss: 0.00299, val_loss: 0.00267, lr: 2.00E-03, _patience: 10
Epoch: 4 | train_loss: 0.00287, val_loss: 0.00261, lr: 2.00E-03, _patience: 10
...
Epoch: 27 | train_loss: 0.00167, val_loss: 0.00250, lr: 2.00E-04, _patience: 4
Epoch: 28 | train_loss: 0.00160, val_loss: 0.00252, lr: 2.00E-04, _patience: 3
Epoch: 29 | train_loss: 0.00154, val_loss: 0.00250, lr: 2.00E-04, _patience: 2
Epoch: 30 | train_loss: 0.00153, val_loss: 0.00250, lr: 2.00E-04, _patience: 1
Stopping early!


### Evaluation

 1 2 3 4 5 6 7 8 # Threshold-PR curve train_loss, y_true, y_prob = trainer.eval_step(dataloader=train_dataloader) precisions, recalls, thresholds = precision_recall_curve(y_true.ravel(), y_prob.ravel()) plt.plot(thresholds, precisions[:-1], "r--", label="Precision") plt.plot(thresholds, recalls[:-1], "b-", label="Recall") plt.ylabel("Performance") plt.xlabel("Threshold") plt.legend(loc='best') 
 1 2 3 # Best threshold for f1 threshold = find_best_threshold(y_true.ravel(), y_prob.ravel()) threshold 
0.22973001


 1 2 3 # Determine predictions using threshold test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) 
 1 2 3 4 # Evaluate performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=2)) 

{
"precision": 0.3170755112080674,
"recall": 0.20761471963996597,
"f1": 0.22826804744644114,
"num_samples": 480.0
}


### Inference

Note

Detailed inspection and inference in the notebook.

limitations: since we're using character embeddings our encoded sequences are quite long (>100), the RNNs may potentially be suffering from memory issues. We also can't process our tokens in parallel because we're restricted by sequential processing.

Note

Don't be afraid to experiment with stacking models if they're able to extract unique signal from your encoded data, for example applying CNNs on the outputs from the RNN (outputs from all tokens, not just last relevant one).

## Transformers w/ Contextual Embeddings

motivation:

• representation: we want better representation for our input tokens via contextual embeddings where the token representation is based on the specific neighboring tokens. We can also use sub-word tokens, as opposed to character or word tokens, since they can hold more meaningful representations for many of our keywords, prefixes, suffixes, etc. without having to use filters with specific widths.
• architecture: we want to use Transformers to attend (in parallel) to all the tokens in our input, as opposed to being limited by filter spans (CNNs) or memory issues from sequential processing (RNNs).
Transformer base architecture [source]

### Set up

 1 2 # Set seeds set_seeds() 
 1 2 3 4 5 # Get data splits preprocessed_df = df.copy() preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True) X_train, X_val, X_test, y_train, y_val, y_test, label_encoder = get_data_splits(preprocessed_df) X_test_raw = X_test 
 1 2 3 4 5 6 7 8 # Set device cuda = True device = torch.device('cuda' if ( torch.cuda.is_available() and cuda) else 'cpu') torch.set_default_tensor_type('torch.FloatTensor') if device.type == 'cuda': torch.set_default_tensor_type('torch.cuda.FloatTensor') print (device) 

#### Tokenizer

We'll be using the BertTokenizer to tokenize our input text in to sub-word tokens.

 1 2 from transformers import DistilBertTokenizer from transformers import BertTokenizer 
 1 2 3 4 5 # Load tokenizer and model # tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') tokenizer = BertTokenizer.from_pretrained('allenai/scibert_scivocab_uncased') vocab_size = len(tokenizer) print (vocab_size) 

Downloading: 100%
228k/228k [00:00<00:00, 1.09MB/s]

31090

  1 2 3 4 5 6 7 8 9 10 11 12 13 # Tokenize inputs encoded_input = tokenizer(X_train.tolist(), return_tensors='pt', padding=True) X_train_ids = encoded_input['input_ids'] X_train_masks = encoded_input['attention_mask'] print (X_train_ids.shape, X_train_masks.shape) encoded_input = tokenizer(X_val.tolist(), return_tensors='pt', padding=True) X_val_ids = encoded_input['input_ids'] X_val_masks = encoded_input['attention_mask'] print (X_val_ids.shape, X_val_masks.shape) encoded_input = tokenizer(X_test.tolist(), return_tensors='pt', padding=True) X_test_ids = encoded_input['input_ids'] X_test_masks = encoded_input['attention_mask'] print (X_test_ids.shape, X_test_masks.shape) 
torch.Size([1000, 41]) torch.Size([1000, 41])
torch.Size([227, 38]) torch.Size([227, 38])
torch.Size([217, 38]) torch.Size([217, 38])

 1 2 # Decode print (f"{X_train_ids[0]}\n{tokenizer.decode(X_train_ids[0])}") 
tensor([  102,  6160,  1923,   288,  3254,  1572, 18205,  5560,  4578,   626,
23474,   291,  2715, 10558,   103,     0,     0,     0,     0,     0,
0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
0])
[CLS] albumentations fast image augmentation library easy use wrapper around libraries [SEP] [PAD] [PAD] ...

 1 2 # Sub-word tokens print (tokenizer.convert_ids_to_tokens(ids=X_train_ids[0])) 
['[CLS]', 'alb', '##ument', '##ations', 'fast', 'image', 'augmentation', 'library', 'easy', 'use', 'wrap', '##per', 'around', 'libraries', '[SEP]', '[PAD]', ...]


### Data imbalance

 1 2 3 4 5 6 7 # Class weights counts = np.bincount([label_encoder.class_to_index[class_] for class_ in all_tags]) class_weights = {i: 1.0/count for i, count in enumerate(counts)} print ("class counts:\n" f" {counts}\n\n" "class weights:\n" f" {class_weights}") 
class counts:
[120  41 388 106  41  75  34  73  51  78  64  51  55  93  51 429  33  69
30  51 258  32  49  59  57  60  48  40 213  40  34  46 196  39  39]

class weights:
{0: 0.008333333333333333, 1: 0.024390243902439025, 2: 0.002577319587628866, 3: 0.009433962264150943, 4: 0.024390243902439025, 5: 0.013333333333333334, 6: 0.029411764705882353, 7: 0.0136986301369863, 8: 0.0196078431372549, 9: 0.01282051282051282, 10: 0.015625, 11: 0.0196078431372549, 12: 0.01818181818181818, 13: 0.010752688172043012, 14: 0.0196078431372549, 15: 0.002331002331002331, 16: 0.030303030303030304, 17: 0.014492753623188406, 18: 0.03333333333333333, 19: 0.0196078431372549, 20: 0.003875968992248062, 21: 0.03125, 22: 0.02040816326530612, 23: 0.01694915254237288, 24: 0.017543859649122806, 25: 0.016666666666666666, 26: 0.020833333333333332, 27: 0.025, 28: 0.004694835680751174, 29: 0.025, 30: 0.029411764705882353, 31: 0.021739130434782608, 32: 0.00510204081632653, 33: 0.02564102564102564, 34: 0.02564102564102564}


### Datasets

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 class TransformerTextDataset(torch.utils.data.Dataset): def __init__(self, ids, masks, targets): self.ids = ids self.masks = masks self.targets = targets def __len__(self): return len(self.targets) def __str__(self): return f"" def __getitem__(self, index): ids = torch.tensor(self.ids[index], dtype=torch.long) masks = torch.tensor(self.masks[index], dtype=torch.long) targets = torch.FloatTensor(self.targets[index]) return ids, masks, targets def create_dataloader(self, batch_size, shuffle=False, drop_last=False): return torch.utils.data.DataLoader( dataset=self, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, pin_memory=False) 
  1 2 3 4 5 6 7 8 9 10 11 12 # Create datasets train_dataset = TransformerTextDataset(ids=X_train_ids, masks=X_train_masks, targets=y_train) val_dataset = TransformerTextDataset(ids=X_val_ids, masks=X_val_masks, targets=y_val) test_dataset = TransformerTextDataset(ids=X_test_ids, masks=X_test_masks, targets=y_test) print ("Data splits:\n" f" Train dataset:{train_dataset.__str__()}\n" f" Val dataset: {val_dataset.__str__()}\n" f" Test dataset: {test_dataset.__str__()}\n" "Sample point:\n" f" ids: {train_dataset[0][0]}\n" f" masks: {train_dataset[0][1]}\n" f" targets: {train_dataset[0][2]}") 

Data splits:
Train dataset:
Val dataset:
Test dataset:
Sample point:
ids: tensor([  102,  6160,  1923,   288,  3254,  1572, 18205,  5560,  4578,   626,
23474,   291,  2715, 10558,   103,     0,     0,     0,     0,     0,
0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
0])
masks: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
targets: tensor([0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
device='cpu')

  1 2 3 4 5 6 7 8 9 10 11 12 13 # Create dataloaders batch_size = 64 train_dataloader = train_dataset.create_dataloader( batch_size=batch_size) val_dataloader = val_dataset.create_dataloader( batch_size=batch_size) test_dataloader = test_dataset.create_dataloader( batch_size=batch_size) batch = next(iter(train_dataloader)) print ("Sample batch:\n" f" ids: {batch[0].size()}\n" f" masks: {batch[1].size()}\n" f" targets: {batch[2].size()}") 
Sample batch:
ids: torch.Size([64, 41])
targets: torch.Size([64, 35])


### Model

We're going to use a pretrained BertModel to act as a feature extractor. We'll only use the encoder to receive sequential and pooled outputs (is_decoder=False is default).

 1 from transformers import BertModel 
 1 2 3 4 # transformer = BertModel.from_pretrained("distilbert-base-uncased") # embedding_dim = transformer.config.dim transformer = BertModel.from_pretrained("allenai/scibert_scivocab_uncased") embedding_dim = transformer.config.hidden_size 
  1 2 3 4 5 6 7 8 9 10 11 12 13 class Transformer(nn.Module): def __init__(self, transformer, dropout_p, embedding_dim, num_classes): super(Transformer, self).__init__() self.transformer = transformer self.dropout = torch.nn.Dropout(dropout_p) self.fc1 = torch.nn.Linear(embedding_dim, num_classes) def forward(self, inputs): ids, masks = inputs seq, pool = self.transformer(input_ids=ids, attention_mask=masks) z = self.dropout(pool) z = self.fc1(z) return z 

Note

We decided to work with the pooled output, but we could have just as easily worked with the sequential output (encoder representation for each sub-token) and applied a CNN (or other decoder options) on top of it.

 1 2 3 4 5 6 7 # Initialize model dropout_p = 0.5 model = Transformer( transformer=transformer, dropout_p=dropout_p, embedding_dim=embedding_dim, num_classes=num_classes) model = model.to(device) print (model.named_parameters) 
bound method Module.named_parameters of Transformer(
(transformer): BertModel(
(embeddings): BertEmbeddings(
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(1): BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
...
11 more BertLayers
...
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(dropout): Dropout(p=0.5, inplace=False)
(fc1): Linear(in_features=768, out_features=35, bias=True)
)


### Training

 1 2 3 4 # Arguments lr = 1e-4 num_epochs = 200 patience = 10 
 1 2 3 # Define loss class_weights_tensor = torch.Tensor(np.array(list(class_weights.values()))) loss = nn.BCEWithLogitsLoss(weight=class_weights_tensor) 
 1 2 3 4 # Define optimizer & scheduler optimizer = torch.optim.Adam(model.parameters(), lr=lr) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau( optimizer, mode='min', factor=0.1, patience=5) 
 1 2 3 4 # Trainer module trainer = Trainer( model=model, device=device, loss_fn=loss_fn, optimizer=optimizer, scheduler=scheduler) 
 1 2 3 # Train best_model = trainer.train( num_epochs, patience, train_dataloader, val_dataloader) 

Epoch: 1 | train_loss: 0.00647, val_loss: 0.00354, lr: 1.00E-04, _patience: 10
Epoch: 2 | train_loss: 0.00331, val_loss: 0.00280, lr: 1.00E-04, _patience: 10
Epoch: 3 | train_loss: 0.00295, val_loss: 0.00272, lr: 1.00E-04, _patience: 10
Epoch: 4 | train_loss: 0.00291, val_loss: 0.00271, lr: 1.00E-04, _patience: 10
...
Epoch: 43 | train_loss: 0.00039, val_loss: 0.00130, lr: 1.00E-06, _patience: 4
Epoch: 44 | train_loss: 0.00038, val_loss: 0.00130, lr: 1.00E-06, _patience: 3
Epoch: 45 | train_loss: 0.00038, val_loss: 0.00130, lr: 1.00E-06, _patience: 2
Epoch: 46 | train_loss: 0.00038, val_loss: 0.00130, lr: 1.00E-06, _patience: 1
Stopping early!


### Evaluation

 1 2 3 4 5 6 7 8 # Threshold-PR curve train_loss, y_true, y_prob = trainer.eval_step(dataloader=train_dataloader) precisions, recalls, thresholds = precision_recall_curve(y_true.ravel(), y_prob.ravel()) plt.plot(thresholds, precisions[:-1], "r--", label="Precision") plt.plot(thresholds, recalls[:-1], "b-", label="Recall") plt.ylabel("Performance") plt.xlabel("Threshold") plt.legend(loc='best') 
 1 2 3 # Best threshold for f1 threshold = find_best_threshold(y_true.ravel(), y_prob.ravel()) threshold 
0.34790307


 1 2 3 # Determine predictions using threshold test_loss, y_true, y_prob = trainer.eval_step(dataloader=test_dataloader) y_pred = np.array([np.where(prob >= threshold, 1, 0) for prob in y_prob]) 
 1 2 3 4 # Evaluate performance = get_metrics( y_true=y_test, y_pred=y_pred, classes=label_encoder.classes) print (json.dumps(performance['overall'], indent=2)) 

{
"precision": 0.7524809959244634,
"recall": 0.5251264830544388,
"f1": 0.5904032248915119,
"num_samples": 480.0
}


### Inference

Note

Detailed inspection, inference and visualization of attention heads in the notebook.

limitations: transformers can be quite large and we'll have to weigh tradeoffs before deciding on a model.

We're going to go with the embeddings via CNN approach and optimize it because performance is quite similar to the contextualized embeddings via transformers approach but at much lower cost.

 1 2 3 4 5 6 7 # Performance with open(Path("cnn", "performance.json"), "r") as fp: cnn_performance = json.load(fp) with open(Path("transformers", "performance.json"), "r") as fp: transformers_performance = json.load(fp) print (f'CNN: f1 = {cnn_performance["overall"]["f1"]}') print (f'Transformer: f1 = {transformers_performance["overall"]["f1"]}') 
CNN: f1 = 0.6119912020434568
Transformer: f1 = 0.5904032248915119


This was just one run on one split so you'll want to experiment with k-fold cross validation to properly reach any conclusions about performance. Also make sure you take the time to tune these baselines since their training periods are quite fast (we can achieve f1 of 0.7 with just a bit of tuning for both CNN / Transformers). We'll cover hyperparameter tuning in a few lessons so you can replicate the process here on your own time. We should also benchmark on other important metrics as we iterate, not just precision and recall.

 1 2 3 # Size print (f'CNN: {Path("cnn", "model.pt").stat().st_size/1000000:.1f} MB') print (f'Transformer: {Path("transformers", "model.pt").stat().st_size/1000000:.1f} MB') 
CNN: 4.3 MB
Transformer: 439.9 MB


We'll consider other tradeoffs such as maintenance overhead, bias test passes, etc. as we develop.

Note

Interpretability was not one of requirements but note that we could've tweaked model outputs to deliver it. For example, since we used SAME padding for our CNN, we can use the activation scores to extract influential n-grams. Similarly, we could have used self-attention weights from our Transformer encoder to find influential sub-tokens.

## Resources

To cite this lesson, please use:

 1 2 3 4 5 6 @article{madewithml, title = "Baselines - Made With ML", author = "Goku Mohandas", url = "https://madewithml.com/courses/mlops/baselines/" year = "2021", }