๐ฌ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
Baselines are simple benchmarks which pave the way for iterative development:
Rapid experimentation via hyperparameter tuning thanks to low model complexity.
Discovery of data issues, false assumptions, bugs in code, etc. since model itself is not complex.
Pareto's principle: we can achieve decent performance with minimal initial effort.
Process
Here is the high level approach to establishing baselines:
Start with the simplest possible baseline to compare subsequent development with. This is often a random (chance) model.
Develop a rule-based approach (when possible) using IFTTT, auxiliary data, etc.
Slowly add complexity by addressing limitations and motivating representations and model architectures.
Weigh tradeoffs (performance, latency, size, etc.) between performant baselines.
Revisit and iterate on baselines as your dataset grows.
We can also baseline on your dataset. Instead of using a fixed dataset and iterating on the models, choose a good baseline and iterate on the dataset:
remove or fix data samples (FP, FN)
prepare and transform features
expand or consolidate classes
incorporate auxiliary datasets
identify unique slices to boost
Tradeoffs to consider
When choosing what model architecture(s) to proceed with, what are important tradeoffs to consider? And how can we prioritize them?
Show answer
Prioritization of these tradeoffs depends on your context.
performance: consider coarse-grained and fine-grained (ex. per-class) performance.
latency: how quickly does your model respond for inference.
size: how large is your model and can you support it's storage.
compute: how much will it cost ($, carbon footprint, etc.) to train your model?
interpretability: does your model need to explain its predictions?
biaschecks: does your model pass key bias checks?
timetodevelop: how long do you have to develop the first version?
timetoretrain: how long does it take to retrain your model? This is very important to consider if you need to retrain often.
maintenanceoverhead: who and what will be required to maintain your model versions because the real work with ML begins after deploying v1. You can't just hand it off to your site reliability team to maintain it like many teams do with traditional software.
Application
Each application's baseline trajectory varies based on the task and motivations. For our application, we're going to follow this path:
We'll motivate the need for slowly adding complexity from both the representation (ex. embeddings) and architecture (ex. CNNs) views, as well as address the limitation at each step of the way.
If you're unfamiliar with of the modeling concepts here, be sure to check out the Foundations lessons.
We'll first set up some functions that we'll be using across the different baseline experiments.
defset_seeds(seed=1234):"""Set seeds for reproducibility."""np.random.seed(seed)random.seed(seed)torch.manual_seed(seed)torch.cuda.manual_seed(seed)torch.cuda.manual_seed_all(seed)# multi-GPU
1 2 3 4 5 6 7 8 9101112131415161718
defget_data_splits(df,train_size=0.7):""""""# Get dataX=df.text.to_numpy()y=df.tags# Binarize ylabel_encoder=LabelEncoder()label_encoder.fit(y)y=label_encoder.encode(y)# SplitX_train,X_,y_train,y_=iterative_train_test_split(X,y,train_size=train_size)X_val,X_test,y_val,y_test=iterative_train_test_split(X_,y_,train_size=0.5)returnX_train,X_val,X_test,y_train,y_val,y_test,label_encoder
We'll define a Trainer object which we will use for training, validation and inference.
classTrainer(object):def__init__(self,model,device,loss_fn=None,optimizer=None,scheduler=None):# Set paramsself.model=modelself.device=deviceself.loss_fn=loss_fnself.optimizer=optimizerself.scheduler=schedulerdeftrain_step(self,dataloader):"""Train step."""# Set model to train modeself.model.train()loss=0.0# Iterate over train batchesfori,batchinenumerate(dataloader):# Stepbatch=[item.to(self.device)foriteminbatch]# Set deviceinputs,targets=batch[:-1],batch[-1]self.optimizer.zero_grad()# Reset gradientsz=self.model(inputs)# Forward passJ=self.loss_fn(z,targets)# Define lossJ.backward()# Backward passself.optimizer.step()# Update weights# Cumulative Metricsloss+=(J.detach().item()-loss)/(i+1)returnlossdefeval_step(self,dataloader):"""Validation or test step."""# Set model to eval modeself.model.eval()loss=0.0y_trues,y_probs=[],[]# Iterate over val batcheswithtorch.inference_mode():fori,batchinenumerate(dataloader):# Stepbatch=[item.to(self.device)foriteminbatch]# Set deviceinputs,y_true=batch[:-1],batch[-1]z=self.model(inputs)# Forward passJ=self.loss_fn(z,y_true).item()# Cumulative Metricsloss+=(J-loss)/(i+1)# Store outputsy_prob=torch.sigmoid(z).cpu().numpy()y_probs.extend(y_prob)y_trues.extend(y_true.cpu().numpy())returnloss,np.vstack(y_trues),np.vstack(y_probs)defpredict_step(self,dataloader):"""Prediction step."""# Set model to eval modeself.model.eval()y_probs=[]# Iterate over val batcheswithtorch.inference_mode():fori,batchinenumerate(dataloader):# Forward pass w/ inputsinputs,targets=batch[:-1],batch[-1]z=self.model(inputs)# Store outputsy_prob=torch.sigmoid(z).cpu().numpy()y_probs.extend(y_prob)returnnp.vstack(y_probs)deftrain(self,num_epochs,patience,train_dataloader,val_dataloader):best_val_loss=np.infforepochinrange(num_epochs):# Stepstrain_loss=self.train_step(dataloader=train_dataloader)val_loss,_,_=self.eval_step(dataloader=val_dataloader)self.scheduler.step(val_loss)# Early stoppingifval_loss<best_val_loss:best_val_loss=val_lossbest_model=self.model_patience=patience# reset _patienceelse:_patience-=1ifnot_patience:# 0print("Stopping early!")break# Loggingprint(f"Epoch: {epoch+1} | "f"train_loss: {train_loss:.5f}, "f"val_loss: {val_loss:.5f}, "f"lr: {self.optimizer.param_groups[0]['lr']:.2E}, "f"_patience: {_patience}")returnbest_model
Note
Our dataset is small so we'll train using the whole dataset but for larger datasets, we should always test on a small subset (after shuffling when necessary) so we aren't wasting time on compute. Here's how we can easily do this:
1234567
# Shuffling since projects are chronologically organizedifshuffle:df=df.sample(frac=1).reset_index(drop=True)# Subsetifnum_samples:df=df[:num_samples]
Random
motivation: We want to know what random (chance) performance looks like. All of our subsequent baselines should perform better than this.
12
# Set seedsset_seeds()
1234567
# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True,stem=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")print(f"X_val: {X_val.shape}, y_val: {y_val.shape}")print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")
{
"precision": 0.12590604458654545,
"recall": 0.5203426124197003, ← as expected to be ~0.5
"f1": 0.18469743862395557
}
We made the assumption that there is an equal probability for whether an input has a tag or not but this isn't true. Let's use the train split to figure out what the true probability is.
123
# Percentage of 1s (tag presence)tag_p=np.sum(np.sum(y_train))/(len(y_train)*len(label_encoder.classes))print(tag_p)
0.06291428571428571
1234
# Generate weighted random predictionsy_pred=np.random.choice(np.arange(0,2),size=(len(y_test),len(label_encoder.classes)),p=[1-tag_p,tag_p])
limitations: we didn't use any of the signals from our inputs to affect our predictions, so nothing was learned.
Rule-based
motivation: we want to use signals in our inputs (along with domain expertise and auxiliary data) to determine the labels.
12
# Set seedsset_seeds()
Unstemmed
1234
# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)
1234
# Restrict to relevant tagsprint(len(tags_dict))tags_dict={tag:tags_dict[tag]fortaginlabel_encoder.classes}print(len(tags_dict))
defget_classes(text,aliases,tags_dict):"""If a token matches an alias, then add the corresponding tag class (and parent tags if any)."""classes=[]foralias,taginaliases.items():ifaliasintext:# Add tagclasses.append(tag)# Add parent tagsforparentintags_dict[tag]["parents"]:classes.append(parent)returnlist(set(classes))
123
# Sampletext="This project extends gans for data augmentation specifically for object detection tasks."get_classes(text=preprocess(text),aliases=aliases,tags_dict=tags_dict)
We're looking for exact matches with the aliases which isn't always perfect, for example:
1234
print(aliases[preprocess("gan")])# print (aliases[preprocess('gans')]) # this won't find any matchprint(aliases[preprocess("generative adversarial networks")])# print (aliases[preprocess('generative adversarial network')]) # this won't find any match
So let's now stem our aliases as well as the tokens in our input text and then look for matches.
1234
# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True,stem=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)
We'll write proper tests for all of these functions when we move our code to Python scripts.
123
# Sampletext="This project extends gans for data augmentation specifically for object detection tasks."get_classes(text=preprocess(text,stem=True),aliases=aliases,tags_dict=tags_dict)
We can look at overall and per-class performance on our test set.
When considering overall and per-class performance across different models, we should be aware of Simpson's paradox where a model can perform better on every class subset but not overall.
We achieved very high precision at the expensive of low recall. Why?
Show answer
Rule-based approaches can yield labels with high certainty when there is an absolute condition match (high precision) but it fails to generalize or learn implicit patterns to capture the rest of the cases (low recall).
Inference
1234
# Infertext="Transfer learning with transformers for self-supervised learning"print(preprocess(text,stem=True))get_classes(text=preprocess(text,stem=True),aliases=aliases,tags_dict=tags_dict)
transfer learn transform self supervis learn
['self-supervised-learning',
'transfer-learning',
'transformers',
'natural-language-processing']
Now let's see what happens when we replace the word transformers with BERT. Sure we can add this as an alias but doing these kinds of ad-hoc updates can quickly add overhead. This is where it makes sense to learn from the data as opposed to creating explicit rules.
1234
# Infertext="Transfer learning with BERT for self-supervised learning"print(preprocess(text,stem=True))get_classes(text=preprocess(text,stem=True),aliases=aliases,tags_dict=tags_dict)
transfer learn bert self supervis learn
['self-supervised-learning', 'transfer-learning']
limitations: we failed to generalize or learn any implicit patterns to predict the labels because we treat the tokens in our input as isolated entities.
We would ideally spend more time tuning our model because it's so simple and quick to train. This approach also applies to all the other models we'll look at as well.
Simple ML
motivation:
representation: use term frequency-inverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs, as opposed to treating the words in our input text as isolated tokens.
architecture: we want our model to meaningfully extract the encoded signal to predict the output labels.
So far we've treated the words in our input text as isolated tokens and we haven't really captured any meaning between tokens. Let's use term frequencyโinverse document frequency (TF-IDF) to capture the significance of a token to a particular input with respect to all the inputs.
# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True,stem=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)
deffit_and_evaluate(model):"""Fit and evaluate each model."""model.fit(X_train,y_train)y_pred=model.predict(X_test)metrics=precision_recall_fscore_support(y_test,y_pred,average="weighted")return{"precision":metrics[0],"recall":metrics[1],"f1":metrics[2]}
representation: TF-IDF representations don't encapsulate much signal beyond frequency but we require more fine-grained token representations.
architecture: we want to develop models that can use better represented encodings in a more contextual manner.
Distributed training
All the training we need to do for our application happens on one worker with one accelerator (GPU), however, we'll want to consider distributed training for very large models or when dealing with large datasets. Distributed training can involve:
data parallelism: workers received different slices of the larger dataset.
synchronous training uses AllReduce to aggregate gradients and update all the workers weights at the end of each batch (synchronous).
asynchronous training uses a universal parameter server to update weights as each worker trains on its slice of data (asynchronous).
model parallelism: all workers use the same dataset but the model is split amongst them (more difficult to implement compared to data parallelism because it's difficult to isolate and combine signal from backpropagation).
There are lots of options for applying distributed training such as with PyTorch's distributed package, Ray, Horovd, etc.
Optimization
Distributed training strategies are great for when our data or models are too large for training but what about when our models are too large to deploy? The following model compression techniques are commonly used to make large models fit within existing infrastructure:
Pruning: remove weights (unstructured) or entire channels (structured) to reduce the size of the network. The objective is to preserve the modelโs performance while increasing its sparsity.
Quantization: reduce the memory footprint of the weights by reducing their precision (ex. 32 bit to 8 bit). We may loose some precision but it shouldnโt affect performance too much.
Distillation: training smaller networks to โmimicโ larger networks by having it reproduce the larger networkโs layersโ outputs.
Distilling the knowledge in a neural network [source]
CNN w/ Embeddings
motivation:
representation: we want to have more robust (split tokens to characters) and meaningful embedding-based representations for our input tokens.
architecture: we want to process our encoded inputs using convolution (CNN) filters that can learn to analyze windows of embedded tokens to extract meaningful signal.
Set up
We'll set up the task by setting seeds for reproducibility, creating our data splits abd setting the device.
# Get data splitspreprocessed_df=df.copy()preprocessed_df.text=preprocessed_df.text.apply(preprocess,lower=True)X_train,X_val,X_test,y_train,y_val,y_test,label_encoder=get_data_splits(preprocessed_df)X_test_raw=X_test# use for later
# Set devicecuda=Truedevice=torch.device("cuda"if(torch.cuda.is_available()andcuda)else"cpu")torch.set_default_tensor_type("torch.FloatTensor")ifdevice.type=="cuda":torch.set_default_tensor_type("torch.cuda.FloatTensor")print(device)
cuda
Tokenizer
We're going to tokenize our input text as character tokens so we can be robust to spelling errors and learn to generalize across tags. (ex. learning that RoBERTa, or any other future BERT based archiecture, warrants same tag as BERT).
# Convert texts to sequences of indicesX_train=np.array(tokenizer.texts_to_sequences(X_train))X_val=np.array(tokenizer.texts_to_sequences(X_val))X_test=np.array(tokenizer.texts_to_sequences(X_test))preprocessed_text=tokenizer.sequences_to_texts([X_train[0]])[0]print("Text to indices:\n"f" (preprocessed) โ {preprocessed_text}\n"f" (tokenized) โ {X_train[0]}")
We'll factor class weights in our objective function (binary cross entropy with logits) to help with class imbalance. There are many other techniques such as over sampling from underrepresented classes, undersampling, etc. but we'll cover these in a separate unit lesson on data imbalance.
1234
# Class weightscounts=np.bincount([label_encoder.class_to_index[class_]forclass_inall_tags])class_weights={i:1.0/countfori,countinenumerate(counts)}print(f"class counts: {counts},\nclass weights: {class_weights}")
We're going to place our data into a Dataset and use a DataLoader to efficiently create batches for training and evaluation.
1234567
defpad_sequences(sequences,max_seq_len=0):"""Pad sequences to max length in sequence."""max_seq_len=max(max_seq_len,max(len(sequence)forsequenceinsequences))padded_sequences=np.zeros((len(sequences),max_seq_len))fori,sequenceinenumerate(sequences):padded_sequences[i][:len(sequence)]=sequencereturnpadded_sequences
classCNNTextDataset(torch.utils.data.Dataset):def__init__(self,X,y,max_filter_size):self.X=Xself.y=yself.max_filter_size=max_filter_sizedef__len__(self):returnlen(self.y)def__str__(self):returnf"<Dataset(N={len(self)})>"def__getitem__(self,index):X=self.X[index]y=self.y[index]return[X,y]defcollate_fn(self,batch):"""Processing on a batch."""# Get inputsbatch=np.array(batch)X=batch[:,0]y=batch[:,1]# Pad inputsX=pad_sequences(sequences=X,max_seq_len=self.max_filter_size)# CastX=torch.LongTensor(X.astype(np.int32))y=torch.FloatTensor(y.astype(np.int32))returnX,ydefcreate_dataloader(self,batch_size,shuffle=False,drop_last=False):returntorch.utils.data.DataLoader(dataset=self,batch_size=batch_size,collate_fn=self.collate_fn,shuffle=shuffle,drop_last=drop_last,pin_memory=True)
1 2 3 4 5 6 7 8 9101112131415
# Create datasetsfilter_sizes=list(range(1,11))train_dataset=CNNTextDataset(X=X_train,y=y_train,max_filter_size=max(filter_sizes))val_dataset=CNNTextDataset(X=X_val,y=y_val,max_filter_size=max(filter_sizes))test_dataset=CNNTextDataset(X=X_test,y=y_test,max_filter_size=max(filter_sizes))print("Data splits:\n"f" Train dataset:{train_dataset.__str__()}\n"f" Val dataset: {val_dataset.__str__()}\n"f" Test dataset: {test_dataset.__str__()}\n""Sample point:\n"f" X: {train_dataset[0][0]}\n"f" y: {train_dataset[0][1]}")
We'll be using a convolutional neural network on top of our embedded tokens to extract meaningful spatial signal. This time, we'll be using many filter widths to act as n-gram feature extractors. If you're not familiar with CNNs be sure to check out the CNN lesson where we walkthrough every component of the architecture.
Let's visualize the model's forward pass.
We'll first tokenize our inputs (batch_size, max_seq_len).
Then we'll embed our tokenized inputs (batch_size, max_seq_len, embedding_dim).
We'll apply convolution via filters (filter_size, embedding_dim, num_filters) followed by batch normalization. Our filters act as character level n-gram detectors. We have three different filter sizes (2, 3 and 4) and they will act as bi-gram, tri-gram and 4-gram feature extractors, respectively.
We'll apply 1D global max pooling which will extract the most relevant information from the feature maps for making the decision.
We feed the pool outputs to a fully-connected (FC) layer (with dropout).
We use one more FC layer with softmax to derive class probabilities.
VALID: no padding, the filters only use the "valid" values in the input. If the filter cannot reach all the input values (filters go left to right), the extra values on the right are dropped.
SAME: adds padding evenly to the right (preferred) and left sides of the input so that all values in the input are processed.
We're add SAME padding so that the convolutional outputs are the same shape as our inputs. The amount of padding for the SAME padding can be determined using the same equation. We want out output to have the same width as our input, so we solve for P:
\[ \frac{W-F+2P}{S} + 1 = W \]
\[ P = \frac{S(W-1) - W + F}{2} \]
If \(P\) is not a whole number, we round up (using math.ceil) and place the extra padding on the right side.
# Determining the best thresholddeffind_best_threshold(y_true,y_prob):"""Find the best threshold for maximum F1."""precisions,recalls,thresholds=precision_recall_curve(y_true,y_prob)f1s=(2*precisions*recalls)/(precisions+recalls)returnthresholds[np.argmax(f1s)]
123
# Best threshold for f1threshold=find_best_threshold(y_true.ravel(),y_prob.ravel())threshold
0.23890994
How can we do better?
How can we improve on our process of identifying and using the appropriate threshold?
Show answer
Plot PR curves for all classes (not just overall) to ensure a certain global threshold doesn't deliver very poor performance for any particular class
Determine different thresholds for different classes and use them during inference
123
# Determine predictions using thresholdtest_loss,y_true,y_prob=trainer.eval_step(dataloader=test_dataloader)y_pred=np.array([np.where(prob>=threshold,1,0)forprobiny_prob])
# Save artifactsdir=Path("cnn")dir.mkdir(parents=True,exist_ok=True)tokenizer.save(fp=Path(dir,"tokenzier.json"))label_encoder.save(fp=Path(dir,"label_encoder.json"))torch.save(best_model.state_dict(),Path(dir,"model.pt"))withopen(Path(dir,"performance.json"),"w")asfp:json.dump(performance,indent=2,sort_keys=False,fp=fp)
# Dataloadertext="Transfer learning with BERT for self-supervised learning"X=np.array(tokenizer.texts_to_sequences([preprocess(text)]))y_filler=label_encoder.encode([np.array([label_encoder.classes[0]]*len(X))])dataset=CNNTextDataset(X=X,y=y_filler,max_filter_size=max(filter_sizes))dataloader=dataset.create_dataloader(batch_size=batch_size)
architecture: extracting signal from encoded inputs is limited by filter widths.
Since we're dealing with simple architectures and fast training times, it's a good opportunity to explore tuning and experiment with k-fold cross validation to properly reach any conclusions about performance.
Tradeoffs
We're going to go with the embeddings via CNN approach and optimize it because performance is quite similar to the contextualized embeddings via transformers approach but at much lower cost.
1234
# Performancewithopen(Path("cnn","performance.json"),"r")asfp:cnn_performance=json.load(fp)print(f'CNN: f1 = {cnn_performance["f1"]}')
CNN: f1 = 0.6612833723992106
This was just one run on one split so you'll want to experiment with k-fold cross validation to properly reach any conclusions about performance. Also make sure you take the time to tune these baselines since their training periods are quite fast (we can achieve f1 of 0.7 with just a bit of tuning for both CNN / Transformers). We'll cover optimization in a few lessons so you can replicate the process here on your own time. We should also benchmark on other important metrics as we iterate, not just precision and recall.
We'll consider other tradeoffs such as maintenance overhead, behavioral test performances, etc. as we develop.
Interpretability was not one of requirements but note that we could've tweaked model outputs to deliver it. For example, since we used SAME padding for our CNN, we can use the activation scores to extract influential n-grams.