Skip to content

Splitting

Repository ยท Notebook

Appropriately splitting our dataset (multi-label) for training, validation and testing.

Intuition

Why do we need it?

To determine the efficacy of our models, we need to have an unbiased measuring approach. To do this, we split our dataset into training, validation, and testing data splits. Here is the process:

  1. Use the training split to train the model.

    Here the model will have access to both inputs and outputs to optimize its internal weights.

  2. After each loop (epoch) of the training split, we will use the validation split to determine model performance.

    Here the model will not use the outputs to optimize its weights but instead, we will use the performance to optimize training hyperparameters such as the learning rate, etc.

  3. After training stops (epoch(s)), we will use the testing split to perform a one-time assessment of the model.

    This is our best measure of how the model may behave on new, unseen data. Note that training stops when the performance improvement is not significant or any other stopping criteria that we may have specified.

How can we do it?

We need to ensure that our data is properly split so we can trust our evaluations. A few criteria are:

  • the dataset (and each data split) should be representative of data we will encounter
  • equal distributions of output values across all splits
  • shuffle your data if it's organized in a way that prevents input variance
  • avoid random shuffles if you task can suffer from data leaks (ex. time-series)

Note

You need to clean your data first before splitting, at least for the features that splitting depends on. So the process is more like: preprocessing (global, cleaning) โ†’ splitting โ†’ preprocessing (local, transformations).

Application

Label encoding

Before we split our dataset, we're going to encode our output labels where we'll be assigning each tag a unique index.

1
2
import numpy as np
import random
1
2
3
4
# Set seeds for reproducability
seed = 42
np.random.seed(seed)
random.seed(seed)

We need to shuffle our data since the latest projects are upfront and certain tags are trending now compared to a year ago. If we don't shuffle and create our data splits, then our model will only be trained on earlier tags and perform poorly on others.

1
2
# Shuffle
df = df.sample(frac=1).reset_index(drop=True)
1
2
3
# Get data
X = df.text.to_numpy()
y = df.tags

We'll be writing our own LabelEncoder which is based on scikit-learn's implementation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class LabelEncoder(object):
    """Label encoder for tag labels."""
    def __init__(self, class_to_index={}):
        self.class_to_index = class_to_index
        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
        self.classes = list(self.class_to_index.keys())

    def __len__(self):
        return len(self.class_to_index)

    def __str__(self):
        return f"<LabelEncoder(num_classes={len(self)})>"

    def fit(self, y):
        classes = np.unique(list(itertools.chain.from_iterable(y)))
        for i, class_ in enumerate(classes):
            self.class_to_index[class_] = i
        self.index_to_class = {v: k for k, v in self.class_to_index.items()}
        self.classes = list(self.class_to_index.keys())
        return self

    def encode(self, y):
        y_one_hot = np.zeros((len(y), len(self.class_to_index)), dtype=int)
        for i, item in enumerate(y):
            for class_ in item:
                y_one_hot[i][self.class_to_index[class_]] = 1
        return y_one_hot

    def decode(self, y):
        classes = []
        for i, item in enumerate(y):
            indices = np.where(item == 1)[0]
            classes.append([self.index_to_class[index] for index in indices])
        return classes

    def save(self, fp):
        with open(fp, 'w') as fp:
            contents = {'class_to_index': self.class_to_index}
            json.dump(contents, fp, indent=4, sort_keys=False)

    @classmethod
    def load(cls, fp):
        with open(fp, 'r') as fp:
            kwargs = json.load(fp=fp)
        return cls(**kwargs)

Note

If you're not familiar with the @classmethod decorator, learn more about it from our Python lesson.

1
2
3
4
# Encode
label_encoder = LabelEncoder()
label_encoder.fit(y)
num_classes = len(label_encoder)
1
label_encoder.class_to_index

{'attention': 0,
 'autoencoders': 1,
 'computer-vision': 2,
 ...
 'transfer-learning': 31,
 'transformers': 32,
 'unsupervised-learning': 33,
 'wandb': 34}

Since we're dealing with multilabel classification, we're going to convert our label indices into one-hot representation where each input's set of labels is represented by a binary array.

1
2
# Sample
label_encoder.encode([["attention", "data-augmentation"]])
array([[1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
1
2
# Encode all our labels
y = label_encoder.encode(y)

Naive split

For traditional multi-class tasks (each input has one label), we want to ensure that each data split has similar class distributions. However, our task is multi-label classification (an input can have many labels) which complicates the stratification process.

First, we'll naively split our dataset randomly and show the large deviations between the (adjusted) class distributions across the splits. We'll use scikit-learn's train_test_split function to do the splits.

1
2
from sklearn.model_selection import train_test_split
from skmultilearn.model_selection.measures import get_combination_wise_output_matrix
1
2
3
4
# Split sizes
train_size = 0.7
val_size = 0.15
test_size = 0.15
1
2
# Split (train)
X_train, X_, y_train, y_ = train_test_split(X, y, train_size=train_size)
1
2
print (f"train: {len(X_train)} ({(len(X_train) / len(X)):.2f})\n"
       f"remaining: {len(X_)} ({(len(X_) / len(X)):.2f})")

train: 1010 (0.70)
remaining: 434 (0.30)

1
2
3
# Split (test)
X_val, X_test, y_val, y_test = train_test_split(
    X_, y_, train_size=0.5)
1
2
3
print(f"train: {len(X_train)} ({len(X_train)/len(X):.2f})\n"
      f"val: {len(X_val)} ({len(X_val)/len(X):.2f})\n"
      f"test: {len(X_test)} ({len(X_test)/len(X):.2f})")

train: 1010 (0.70)
val: 217 (0.15)
test: 217 (0.15)

1
2
3
4
5
6
7
8
# Get counts for each class
counts = {}
counts['train_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_train, order=1) for combination in row)
counts['val_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_val, order=1) for combination in row)
counts['test_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_test, order=1) for combination in row)
1
2
3
4
5
6
# View distributions
pd.DataFrame({
    'train': counts['train_counts'],
    'val': counts['val_counts'],
    'test': counts['test_counts']
}).T.fillna(0)

(15,) (19,) (33,) (21,) (32,) (14,) (2,) (20,) (24,) (5,) (16,) (9,) (22,) (12,) (23,) (0,) (25,) (34,) (28,) (10,) (26,) (27,) (13,) (17,) (3,) (1,) (7,) (11,) (18,) (4,) (6,) (29,) (8,) (31,) (30,)
train 314 37 26 26 145 33 274 191 41 55 26 56 34 41 40 90 44 28 136 44 30 31 63 45 64 27 50 34 21 33 24 24 32 33 23
val 58 8 4 2 29 8 53 33 7 9 4 11 9 10 11 20 7 2 42 13 12 4 14 14 17 3 7 6 3 3 5 7 10 7 4
test 57 6 9 4 22 10 61 34 9 11 3 11 6 4 8 10 9 9 35 7 6 5 16 10 25 11 16 11 6 5 5 9 9 6 7

It's hard to compare these because our train and test proportions are different. Let's see what the distribution looks like once we balance it out. What do we need to multiply our test ratio by so that we have the same amount as our train ratio?

\[ \alpha * N_{test} = N_{train} \]
\[ \alpha = \frac{N_{train}}{N_{test}} \]

1
2
3
4
5
6
7
# Adjust counts across splits
for k in counts['val_counts'].keys():
    counts['val_counts'][k] = int(counts['val_counts'][k] * \
        (train_size/val_size))
for k in counts['test_counts'].keys():
    counts['test_counts'][k] = int(counts['test_counts'][k] * \
        (train_size/test_size))
1
2
3
4
5
6
dist_df = pd.DataFrame({
    'train': counts['train_counts'],
    'val': counts['val_counts'],
    'test': counts['test_counts']
}).T.fillna(0)
dist_df

(15,) (19,) (33,) (21,) (32,) (14,) (2,) (20,) (24,) (5,) (16,) (9,) (22,) (12,) (23,) (0,) (25,) (34,) (28,) (10,) (26,) (27,) (13,) (17,) (3,) (1,) (7,) (11,) (18,) (4,) (6,) (29,) (8,) (31,) (30,)
train 314 37 26 26 145 33 274 191 41 55 26 56 34 41 40 90 44 28 136 44 30 31 63 45 64 27 50 34 21 33 24 24 32 33 23
val 270 37 18 9 135 37 247 154 32 42 18 51 42 46 51 93 32 9 196 60 56 18 65 65 79 14 32 28 14 14 23 32 46 32 18
test 266 28 42 18 102 46 284 158 42 51 14 51 28 18 37 46 42 42 163 32 28 23 74 46 116 51 74 51 28 23 23 42 42 28 32

We can see how much deviance there is in our naive data splits by computing the standard deviation of each split's class counts from the mean (ideal split).

\[ \sigma = \sqrt{\frac{(x - \bar{x})^2}{N}} \]
1
2
# Standard deviation
np.mean(np.std(dist_df.to_numpy(), axis=0))
9.936725114942407

Note

For simple multiclass classification, you can specify how to stratify the split by adding the stratify keyword argument. But our task is multilabel classification, so we'll need to use other techniques to create even splits.

Stratified split

Now we'll apply iterative stratification via the skmultilearn library, which essentially splits each input into subsets (where each label is considered individually) and then it distributes the samples starting with fewest "positive" samples and working up to the inputs that have the most labels.

1
from skmultilearn.model_selection import IterativeStratification
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def iterative_train_test_split(X, y, train_size):
    """Custom iterative train test split which
    'maintains balanced representation with respect
    to order-th label combinations.'
    """
    stratifier = IterativeStratification(
        n_splits=2, order=1, sample_distribution_per_fold=[1.0-train_size, train_size, ])
    train_indices, test_indices = next(stratifier.split(X, y))
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    return X_train, X_test, y_train, y_test
1
2
3
# Get data
X = df.text.to_numpy()
y = df.tags
1
2
3
4
# Binarize y
label_encoder = LabelEncoder()
label_encoder.fit(y)
y = label_encoder.encode(y)
1
2
3
4
5
# Split
X_train, X_, y_train, y_ = iterative_train_test_split(
    X, y, train_size=train_size)
X_val, X_test, y_val, y_test = iterative_train_test_split(
    X_, y_, train_size=0.5)
1
2
3
print(f"train: {len(X_train)} ({len(X_train)/len(X):.2f})\n"
      f"val: {len(X_val)} ({len(X_val)/len(X):.2f})\n"
      f"test: {len(X_test)} ({len(X_test)/len(X):.2f})")

train: 1000 (0.69)
val: 214 (0.15)
test: 230 (0.16)

Let's see what the adjusted counts look like for these stratified data splits.

1
2
3
4
5
6
7
8
# Get counts for each class
counts = {}
counts['train_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_train, order=1) for combination in row)
counts['val_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_val, order=1) for combination in row)
counts['test_counts'] = Counter(str(combination) for row in get_combination_wise_output_matrix(
    y_test, order=1) for combination in row)
1
2
3
4
5
6
7
# Adjust counts across splits
for k in counts['val_counts'].keys():
    counts['val_counts'][k] = int(counts['val_counts'][k] * \
        (train_size/val_size))
for k in counts['test_counts'].keys():
    counts['test_counts'][k] = int(counts['test_counts'][k] * \
        (train_size/test_size))
1
2
3
4
5
6
# View distributions
pd.DataFrame({
    'train': counts['train_counts'],
    'val': counts['val_counts'],
    'test': counts['test_counts']
}).T.fillna(0)

(2,) (4,) (15,) (14,) (30,) (34,) (1,) (26,) (32,) (20,) (33,) (25,) (17,) (21,) (0,) (24,) (27,) (6,) (13,) (3,) (5,) (16,) (9,) (19,) (7,) (28,) (11,) (22,) (8,) (29,) (23,) (31,) (10,) (18,) (12,)
train 272.0 29.0 300.0 36.0 24.0 27.0 29.0 32.0 145.0 181.0 27.0 42.0 49.0 27.0 84.0 40.0 28.0 24.0 65.0 74.0 52.0 19.0 55.0 36.0 51.0 149.0 30.0 34.0 38.0 26.0 41.0 32.0 45.0 21.0 38.0
val 270.0 32.0 298.0 28.0 23.0 28.0 32.0 37.0 112.0 177.0 28.0 42.0 46.0 0.0 84.0 51.0 28.0 23.0 74.0 74.0 56.0 18.0 51.0 32.0 51.0 149.0 46.0 32.0 32.0 32.0 42.0 32.0 46.0 23.0 42.0
test 270.0 23.0 303.0 42.0 23.0 28.0 23.0 37.0 126.0 182.0 28.0 42.0 46.0 23.0 84.0 28.0 28.0 23.0 56.0 74.0 51.0 46.0 56.0 37.0 51.0 149.0 51.0 37.0 28.0 32.0 42.0 32.0 42.0 18.0 37.0

1
2
3
4
5
dist_df = pd.DataFrame({
    'train': counts['train_counts'],
    'val': counts['val_counts'],
    'test': counts['test_counts']
}).T.fillna(0)
1
2
# Standard deviation
np.mean(np.std(dist_df.to_numpy(), axis=0))

3.142338654518357

The standard deviation is much better but not 0 (perfect splits) because keep in mind that an input can have any combination of of classes yet each input can only belong in one of the data splits.

Note

Iterative stratification essentially creates splits while "trying to maintain balanced representation with respect to order-th label combinations". We used to an order=1 for our iterative split which means we cared about providing representative distribution of each tag across the splits. But we can account for higher-order label relationships as well where we may care about the distribution of label combinations.

Resources