๐ฌ Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Intuition
To determine the efficacy of our models, we need to have an unbiased measuring approach. To do this, we split our dataset into training, validation, and testing data splits.
Use the training split to train the model.
Here the model will have access to both inputs and outputs to optimize its internal weights.
After each loop (epoch) of the training split, we will use the validation split to determine model performance.
Here the model will not use the outputs to optimize its weights but instead, we will use the performance to optimize training hyperparameters such as the learning rate, etc.
After training stops (epoch(s)), we will use the testing split to perform a one-time assessment of the model.
This is our best measure of how the model may behave on new, unseen data. Note that training stops when the performance improvement is not significant or any other stopping criteria that we may have specified.
Creating proper data splits
What are the criteria we should focus on to ensure proper data splits?
Show answer
the dataset (and each data split) should be representative of data we will encounter
equal distributions of output values across all splits
shuffle your data if it's organized in a way that prevents input variance
avoid random shuffles if your task can suffer from data leaks (ex. time-series)
We need to clean our data first before splitting, at least for the features that splitting depends on. So the process is more like: preprocessing (global, cleaning) โ splitting โ preprocessing (local, transformations).
Label encoding
Before we split our dataset, we're going to encode our output labels where we'll be assigning each tag a unique index.
12
importnumpyasnpimportrandom
1234
# Set seeds for reproducibilityseed=42np.random.seed(seed)random.seed(seed)
We need to shuffle our data since the latest projects are upfront and certain tags are trending now compared to a year ago. If we don't shuffle and create our data splits, then our model will only be trained on earlier tags and perform poorly on others.
classLabelEncoder(object):"""Label encoder for tag labels."""def__init__(self,class_to_index={}):self.class_to_index=class_to_indexself.index_to_class={v:kfork,vinself.class_to_index.items()}self.classes=list(self.class_to_index.keys())def__len__(self):returnlen(self.class_to_index)def__str__(self):returnf"<LabelEncoder(num_classes={len(self)})>"deffit(self,y):classes=np.unique(list(itertools.chain.from_iterable(y)))fori,class_inenumerate(classes):self.class_to_index[class_]=iself.index_to_class={v:kfork,vinself.class_to_index.items()}self.classes=list(self.class_to_index.keys())returnselfdefencode(self,y):y_one_hot=np.zeros((len(y),len(self.class_to_index)),dtype=int)fori,iteminenumerate(y):forclass_initem:y_one_hot[i][self.class_to_index[class_]]=1returny_one_hotdefdecode(self,y):classes=[]fori,iteminenumerate(y):indices=np.where(item==1)[0]classes.append([self.index_to_class[index]forindexinindices])returnclassesdefsave(self,fp):withopen(fp,"w")asfp:contents={"class_to_index":self.class_to_index}json.dump(contents,fp,indent=4,sort_keys=False)@classmethoddefload(cls,fp):withopen(fp,"r")asfp:kwargs=json.load(fp=fp)returncls(**kwargs)
If you're not familiar with the @classmethod decorator, learn more about it from our Python lesson.
Since we're dealing with multilabel classification, we're going to convert our label indices into one-hot representation where each input's set of labels is represented by a binary array.
For traditional multi-class tasks (each input has one label), we want to ensure that each data split has similar class distributions. However, our task is multi-label classification (an input can have many labels) which complicates the stratification process.
First, we'll naively split our dataset randomly and show the large deviations between the (adjusted) class distributions across the splits. We'll use scikit-learn's train_test_split function to do the splits.
# Get counts for each classcounts={}counts["train_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_train,order=1)forcombinationinrow)counts["val_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_val,order=1)forcombinationinrow)counts["test_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_test,order=1)forcombinationinrow)
It's hard to compare these because our train and test proportions are different. Let's see what the distribution looks like once we balance it out. What do we need to multiply our test ratio by so that we have the same amount as our train ratio?
\[ \alpha * N_{test} = N_{train} \]
\[ \alpha = \frac{N_{train}}{N_{test}} \]
1234567
# Adjust counts across splitsforkincounts["val_counts"].keys():counts["val_counts"][k]=int(counts["val_counts"][k]* \
(train_size/val_size))forkincounts["test_counts"].keys():counts["test_counts"][k]=int(counts["test_counts"][k]* \
(train_size/test_size))
(15,)
(19,)
(33,)
(21,)
(32,)
(14,)
(2,)
(20,)
(24,)
(5,)
(16,)
(9,)
(22,)
(12,)
(23,)
(0,)
(25,)
(34,)
(28,)
(10,)
(26,)
(27,)
(13,)
(17,)
(3,)
(1,)
(7,)
(11,)
(18,)
(4,)
(6,)
(29,)
(8,)
(31,)
(30,)
train
314
37
26
26
145
33
274
191
41
55
26
56
34
41
40
90
44
28
136
44
30
31
63
45
64
27
50
34
21
33
24
24
32
33
23
val
270
37
18
9
135
37
247
154
32
42
18
51
42
46
51
93
32
9
196
60
56
18
65
65
79
14
32
28
14
14
23
32
46
32
18
test
266
28
42
18
102
46
284
158
42
51
14
51
28
18
37
46
42
42
163
32
28
23
74
46
116
51
74
51
28
23
23
42
42
28
32
We can see how much deviance there is in our naive data splits by computing the standard deviation of each split's class counts from the mean (ideal split).
\[ \sigma = \sqrt{\frac{(x - \bar{x})^2}{N}} \]
12
# Standard deviationnp.mean(np.std(dist_df.to_numpy(),axis=0))
9.936725114942407
For simple multiclass classification, you can specify how to stratify the split by adding the stratify keyword argument. But our task is multilabel classification, so we'll need to use other techniques to create even splits.
Stratified split
Now we'll apply iterative stratification via the skmultilearn library, which essentially splits each input into subsets (where each label is considered individually) and then it distributes the samples starting with fewest "positive" samples and working up to the inputs that have the most labels.
defiterative_train_test_split(X,y,train_size):"""Custom iterative train test split which 'maintains balanced representation with respect to order-th label combinations.' """stratifier=IterativeStratification(n_splits=2,order=1,sample_distribution_per_fold=[1.0-train_size,train_size,])train_indices,test_indices=next(stratifier.split(X,y))X_train,y_train=X[train_indices],y[train_indices]X_test,y_test=X[test_indices],y[test_indices]returnX_train,X_test,y_train,y_test
Let's see what the adjusted counts look like for these stratified data splits.
12345678
# Get counts for each classcounts={}counts["train_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_train,order=1)forcombinationinrow)counts["val_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_val,order=1)forcombinationinrow)counts["test_counts"]=Counter(str(combination)forrowinget_combination_wise_output_matrix(y_test,order=1)forcombinationinrow)
1234567
# Adjust counts across splitsforkincounts["val_counts"].keys():counts["val_counts"][k]=int(counts["val_counts"][k]* \
(train_size/val_size))forkincounts["test_counts"].keys():counts["test_counts"][k]=int(counts["test_counts"][k]* \
(train_size/test_size))
# Standard deviationnp.mean(np.std(dist_df.to_numpy(),axis=0))
3.142338654518357
The standard deviation is much better but not 0 (perfect splits) because keep in mind that an input can have any combination of of classes yet each input can only belong in one of the data splits.
Iterative stratification essentially creates splits while "trying to maintain balanced representation with respect to order-th label combinations". We used to an order=1 for our iterative split which means we cared about providing representative distribution of each tag across the splits. But we can account for higher-order label relationships as well where we may care about the distribution of label combinations.