Appropriately splitting our dataset (multi-label) for training, validation and testing.
Intuition
Why do we need it?
To determine the efficacy of our models, we need to have an unbiased measuring approach. To do this, we split our dataset into training, validation, and testing data splits. Here is the process:
Use the training split to train the model.
Here the model will have access to both inputs and outputs to optimize its internal weights.
After each loop (epoch) of the training split, we will use the validation split to determine model performance.
Here the model will not use the outputs to optimize its weights but instead, we will use the performance to optimize training hyperparameters such as the learning rate, etc.
After training stops (epoch(s)), we will use the testing split to perform a one-time assessment of the model.
This is our best measure of how the model may behave on new, unseen data. Note that training stops when the performance improvement is not significant or any other stopping criteria that we may have specified.
How can we do it?
We need to ensure that our data is properly split so we can trust our evaluations. A few criteria are:
the dataset (and each data split) should be representative of data we will encounter
equal distributions of output values across all splits
shuffle your data if it's organized in a way that prevents input variance
avoid random shuffles if you task can suffer from data leaks (ex. time-series)
Note
You need to clean your data first before splitting, at least for the features that splitting depends on. So the process is more like: preprocessing (global, cleaning) โ splitting โ preprocessing (local, transformations).
Application
Label encoding
Before we split our dataset, we're going to encode our output labels where we'll be assigning each tag a unique index.
1
2
importnumpyasnpimportrandom
1
2
3
4
# Set seeds for reproducabilityseed=42np.random.seed(seed)random.seed(seed)
We need to shuffle our data since the latest projects are upfront and certain tags are trending now compared to a year ago. If we don't shuffle and create our data splits, then our model will only be trained on earlier tags and perform poorly on others.
classLabelEncoder(object):"""Label encoder for tag labels."""def__init__(self,class_to_index={}):self.class_to_index=class_to_indexself.index_to_class={v:kfork,vinself.class_to_index.items()}self.classes=list(self.class_to_index.keys())def__len__(self):returnlen(self.class_to_index)def__str__(self):returnf"<LabelEncoder(num_classes={len(self)})>"deffit(self,y):classes=np.unique(list(itertools.chain.from_iterable(y)))fori,class_inenumerate(classes):self.class_to_index[class_]=iself.index_to_class={v:kfork,vinself.class_to_index.items()}self.classes=list(self.class_to_index.keys())returnselfdefencode(self,y):y_one_hot=np.zeros((len(y),len(self.class_to_index)),dtype=int)fori,iteminenumerate(y):forclass_initem:y_one_hot[i][self.class_to_index[class_]]=1returny_one_hotdefdecode(self,y):classes=[]fori,iteminenumerate(y):indices=np.where(item==1)[0]classes.append([self.index_to_class[index]forindexinindices])returnclassesdefsave(self,fp):withopen(fp,'w')asfp:contents={'class_to_index':self.class_to_index}json.dump(contents,fp,indent=4,sort_keys=False)@classmethoddefload(cls,fp):withopen(fp,'r')asfp:kwargs=json.load(fp=fp)returncls(**kwargs)
Since we're dealing with multilabel classification, we're going to convert our label indices into one-hot representation where each input's set of labels is represented by a binary array.
For traditional multi-class tasks (each input has one label), we want to ensure that each data split has similar class distributions. However, our task is multi-label classification (an input can have many labels) which complicates the stratification process.
First, we'll naively split our dataset randomly and show the large deviations between the (adjusted) class distributions across the splits. We'll use scikit-learn's train_test_split function to do the splits.
# Get counts for each classcounts={}counts['train_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_train,order=1)forcombinationinrow)counts['val_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_val,order=1)forcombinationinrow)counts['test_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_test,order=1)forcombinationinrow)
It's hard to compare these because our train and test proportions are different. Let's see what the distribution looks like once we balance it out. What do we need to multiply our test ratio by so that we have the same amount as our train ratio?
\[ \alpha * N_{test} = N_{train} \]
\[ \alpha = \frac{N_{train}}{N_{test}} \]
1
2
3
4
5
6
7
# Adjust counts across splitsforkincounts['val_counts'].keys():counts['val_counts'][k]=int(counts['val_counts'][k]* \
(train_size/val_size))forkincounts['test_counts'].keys():counts['test_counts'][k]=int(counts['test_counts'][k]* \
(train_size/test_size))
We can see how much deviance there is in our naive data splits by computing the standard deviation of each split's class counts from the mean (ideal split).
\[ \sigma = \sqrt{\frac{(x - \bar{x})^2}{N}} \]
1
2
# Standard deviationnp.mean(np.std(dist_df.to_numpy(),axis=0))
9.936725114942407
Note
For simple multiclass classification, you can specify how to stratify the split by adding the stratify keyword argument. But our task is multilabel classification, so we'll need to use other techniques to create even splits.
Stratified split
Now we'll apply iterative stratification via the skmultilearn library, which essentially splits each input into subsets (where each label is considered individually) and then it distributes the samples starting with fewest "positive" samples and working up to the inputs that have the most labels.
defiterative_train_test_split(X,y,train_size):"""Custom iterative train test split which 'maintains balanced representation with respect to order-th label combinations.' """stratifier=IterativeStratification(n_splits=2,order=1,sample_distribution_per_fold=[1.0-train_size,train_size,])train_indices,test_indices=next(stratifier.split(X,y))X_train,y_train=X[train_indices],y[train_indices]X_test,y_test=X[test_indices],y[test_indices]returnX_train,X_test,y_train,y_test
Let's see what the adjusted counts look like for these stratified data splits.
1
2
3
4
5
6
7
8
# Get counts for each classcounts={}counts['train_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_train,order=1)forcombinationinrow)counts['val_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_val,order=1)forcombinationinrow)counts['test_counts']=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_test,order=1)forcombinationinrow)
1
2
3
4
5
6
7
# Adjust counts across splitsforkincounts['val_counts'].keys():counts['val_counts'][k]=int(counts['val_counts'][k]* \
(train_size/val_size))forkincounts['test_counts'].keys():counts['test_counts'][k]=int(counts['test_counts'][k]* \
(train_size/test_size))
# Standard deviationnp.mean(np.std(dist_df.to_numpy(),axis=0))
3.142338654518357
The standard deviation is much better but not 0 (perfect splits) because keep in mind that an input can have any combination of of classes yet each input can only belong in one of the data splits.
Note
Iterative stratification essentially creates splits while "trying to maintain balanced representation with respect to order-th label combinations". We used to an order=1 for our iterative split which means we cared about providing representative distribution of each tag across the splits. But we can account for higher-order label relationships as well where we may care about the distribution of label combinations.