Skip to content

Data Preprocessing


Preprocessing our dataset, via through preparations and transformations, to use for training.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

Intuition

Data preprocessing can be categorized into two types of processes: preparation and transformation. We'll explore common preprocessing techniques and then walkthrough the relevant processes for our specific application.

Warning

Certain preprocessing steps are global (don't depend on our dataset, ex. lower casing text, removing stop words, etc.) and others are local (constructs are learned only from the training split, ex. vocabulary, standardization, etc.). For the local, dataset-dependent preprocessing steps, we want to ensure that we split the data first before preprocessing to avoid data leaks.

Preparing

Preparing the data involves organizing and cleaning the data.

Joins

Performing SQL joins with existing data tables to organize all the relevant data you need into one view. This makes working with our dataset a whole lot easier.

1
2
SELECT * FROM A
INNER JOIN B on A.id == B.id

Warning

We need to be careful to perform point-in-time valid joins to avoid data leaks. For example, if Table B may have features for objects in Table A that were not available at the time inference would have been needed.

Missing values

First, we'll have to identify the rows with missing values and once we do, there are several approaches to dealing with them.

  • omit samples with missing values (if only a small subset are missing it)

    1
    2
    3
    4
    5
    6
    # Drop a row (sample) by index
    df.drop([4, 10, ...])
    # Conditionally drop rows (samples)
    df = df[df.value > 0]
    # Drop samples with any missing feature
    df = df[df.isnull().any(axis=1)]
    

  • omit the entire feature (if too many samples are missing the value)

    1
    2
    # Drop a column (feature)
    df.drop(["A"], axis=1)
    

  • fill in missing values for features (using domain knowledge, heuristics, etc.)

    1
    2
    # Fill in missing values with mean
    df.A = df.A.fillna(df.A.mean())
    

  • may not always seem "missing" (ex. 0, null, NA, etc.)

    1
    2
    3
    # Replace zeros to NaNs
    import numpy as np
    df.A = df.A.replace({"0": np.nan, 0: np.nan})
    

Outliers (anomalies)

  • craft assumptions about what is a "normal" expected value
    1
    2
    # Ex. Feature value must be within 2 standard deviations
    df[np.abs(df.A - df.A.mean()) <= (2 * df.A.std())]
    
  • be careful not to remove important outliers (ex. fraud)
  • values may not be outliers when we apply a transformation (ex. power law)
  • anomalies can be global (point), contextual (conditional) or collective (individual points are not anomalous and the collective group is an outlier)

Feature engineering

  • combine features in unique ways to draw out signal
    1
    2
    # Input
    df.C = df.A + df.B
    

Note

This is also when we would save these features to a central feature store for the benefits of:

  • reduce duplication of effort when engineering features.
  • remove training and serving skew for creating and using features.
  • avoid data leaks with point-in-time validation (es. during SQL joins).
  • data validation and monitoring on features distributions.

Learn more about feature stores and implementing them with Feast in our Feature Stores lesson.

Cleaning

  • use domain expertise and EDA
  • apply constraints via filters
  • ensure data type consistency
  • images (crop, resize, clip, etc.)
    1
    2
    3
    4
    # Resize
    import cv2
    dims = (height, width)
    resized_img = cv2.resize(src=img, dsize=dims, interpolation=cv2.INTER_LINEAR)
    
  • text (lower, stem, lemmatize, regex, etc.)
    1
    2
    # Lower case the text
    text = text.lower()
    

Transformations

Transforming the data involves feature encoding and engineering.

Scaling

  • required for models where the scale of the input affects the processes
  • learn constructs from train split and apply to other splits (local)
  • don't blindly scale features (ex. categorical features)

  • standardization: rescale values to mean 0, std 1

    1
    2
    3
    4
    5
    6
    7
    8
    # Standardization
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:\n", x)
    print (f"mean: {np.mean(x):.2f}, std: {np.std(x):.2f}")
    x_standardized = (x - np.mean(x)) / np.std(x)
    print ("x_standardized:\n", x_standardized)
    print (f"mean: {np.mean(x_standardized):.2f}, std: {np.std(x_standardized):.2f}")
    
    x: [0.36769939 0.82302265 0.9891467  0.56200803]
    mean: 0.69, std: 0.24
    x_standardized: [-1.33285946  0.57695671  1.27375049 -0.51784775]
    mean: 0.00, std: 1.00
    

  • min-max: rescale values between a min and max

    1
    2
    3
    4
    5
    6
    7
    8
    # Min-max
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:", x)
    print (f"min: {x.min():.2f}, max: {x.max():.2f}")
    x_scaled = (x - x.min()) / (x.max() - x.min())
    print ("x_scaled:", x_scaled)
    print (f"min: {x_scaled.min():.2f}, max: {x_scaled.max():.2f}")
    
    x: [0.20195674 0.99108855 0.73005081 0.02540603]
    min: 0.03, max: 0.99
    x_scaled: [0.18282479 1.         0.72968575 0.        ]
    min: 0.00, max: 1.00
    

  • binning: convert a continuous feature into categorical using bins

    1
    2
    3
    4
    5
    6
    7
    8
    # Binning
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:", x)
    bins = np.linspace(0, 1, 5) # bins between 0 and 1
    print ("bins:", bins)
    binned = np.digitize(x, bins)
    print ("binned:", binned)
    
    x: [0.54906364 0.1051404  0.2737904  0.2926313 ]
    bins: [0.   0.25 0.5  0.75 1.  ]
    binned: [3 1 2 2]
    

  • and many more!

Note

When we move our code from notebooks to Python scripts, we'll be testing all our preprocessing functions (these workflows can also be captured in feature stores and applied as features are updated).

Encoding

  • allows for representing data efficiently (maintains signal) & effectively (learns pattern)

  • label: unique index for categorical value

    1
    2
    3
    4
    5
    6
    7
    8
    # Label encoding
    label_encoder.class_to_index = {
    "attention": 0,
    "autoencoders": 1,
    "convolutional-neural-networks": 2,
    "data-augmentation": 3,
    ... }
    label_encoder.transform(["attention", "data-augmentation"])
    
    array([2, 2, 1])
    

  • one-hot: representation as binary vector

    1
    2
    # One-hot encoding
    one_hot_encoder.transform(["attention", "data-augmentation"])
    
    array([1, 0, 0, 1, 0, ..., 0])
    

  • embeddings: dense representations capable of representing context

    1
    2
    3
    4
    5
    # Embeddings
    self.embeddings = nn.Embedding(
        embedding_dim=embedding_dim, num_embeddings=vocab_size)
    x_in = self.embeddings(x_in)
    print (x_in.shape)
    
    (len(X), embedding_dim)
    

  • target: represent a categorical feature with the average of the target values that share that categorical value

  • and many more!

Extraction

  • signal extraction from existing features
  • combine existing features
  • transfer learning: using a pretrained model as a feature extractor and finetuning on it's results
  • autoencoders: learn to encode inputs for compressed knowledge representation

  • principle component analysis (PCA): linear dimensionality reduction to project data in a lower dimensional space.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # PCA
    import numpy as np
    from sklearn.decomposition import PCA
    X = np.array([[-1, -1, 3], [-2, -1, 2], [-3, -2, 1]])
    pca = PCA(n_components=2)
    pca.fit(X)
    print (pca.transform(X))
    print (pca.explained_variance_ratio_)
    print (pca.singular_values_)
    
    [[-1.44245791 -0.1744313 ]
     [-0.1148688   0.31291575]
     [ 1.55732672 -0.13848446]]
    [0.96838847 0.03161153]
    [2.12582835 0.38408396]
    

  • counts (ngram): sparse representation of text as matrix of token counts โ€” useful if feature values have lot's of meaningful, separable signal.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    # Counts (ngram)
    from sklearn.feature_extraction.text import CountVectorizer
    y = [
        'acetyl acetone',
        'acetyl chloride',
        'chloride hydroxide',
    ]
    vectorizer = CountVectorizer()
    y = vectorizer.fit_transform(y)
    print (vectorizer.get_feature_names())
    print (y.toarray())
    # ๐Ÿ’ก Repeat above with char-level ngram vectorizer
    # vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3)) # uni, bi and trigrams
    
    ['acetone', 'acetyl', 'chloride', 'hydroxide']
    [[1 1 0 0]
     [0 1 1 0]
     [0 0 1 1]]
    

  • similarity: similar to count vectorization but based on similarities in tokens

  • and many more!

Note

Often, teams will want to reuse the same features for different tasks so how can we avoid duplication of efforts? A solution is feature stores which will enable sharing of features and the workflows around feature pipelines. We'll cover feature stores during Production.

Application

For our application, we'll be implementing a few of these preprocessing steps that are relevant for our dataset.

Feature engineering

We can combine existing input features to create new meaningful signal (helping the model learn). However, there's usually no simple way to know if certain feature combinations will help or not without empirically experimenting with the different combinations. Here, we could use a project's title and description separately as features but we'll combine them to create one input feature.

1
2
# Input
df['text'] = df.title + " " + df.description

Filtering

In the same vain, we can also reduce the size of our data by placing constraints as to what data is worth annotation or labeling. Here we decide to filter tags above a certain frequency threshold because those with fewer samples won't be adequate for training.

1
2
3
4
def filter(l, include=[], exclude=[]):
    """Filter a list using inclusion and exclusion lists of items."""
    filtered = [item for item in l if item in include and item not in exclude]
    return filtered

We're going to include only these tags because they're the tags we care about and we've allowed authors to add any tag they want (noise). We'll also be excluding some general tags because they are automatically added when their children tags are present.

1
2
3
4
# Inclusion/exclusion criteria for tags
include = list(tags_dict.keys())
exclude = ['machine-learning', 'deep-learning',  'data-science',
           'neural-networks', 'python', 'r', 'visualization']

Note

Since we're constraining the output space here, we'll want to monitor the prevalence of new tags over time so we can capture them.

1
2
3
# Filter tags for each project
df.tags = df.tags.apply(filter, include=include, exclude=exclude)
tags = Counter(itertools.chain.from_iterable(df.tags.values))

We're also going to restrict the mapping to only tags that are above a certain frequency threshold. The tags that don't have enough projects will not have enough samples to model their relationships.

1
2
3
4
5
6
7
8
9
@widgets.interact(min_tag_freq=(0, tags.most_common()[0][1]))
def separate_tags_by_freq(min_tag_freq=30):
    tags_above_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] >= min_tag_freq)
    tags_below_freq = Counter(tag for tag in tags.elements()
                                    if tags[tag] < min_tag_freq)
    print ("Most popular tags:\n", tags_above_freq.most_common(5))
    print ("\nTags that just made the cut:\n", tags_above_freq.most_common()[-5:])
    print ("\nTags that just missed the cut:\n", tags_below_freq.most_common(5))

Most popular tags:
 [('natural-language-processing', 429),
  ('computer-vision', 388),
  ('pytorch', 258),
  ('tensorflow', 213),
  ('transformers', 196)]

Tags that just made the cut:
 [('time-series', 34),
  ('flask', 34),
  ('node-classification', 33),
  ('question-answering', 32),
  ('pretraining', 30)]

Tags that just missed the cut:
 [('model-compression', 29),
  ('fastai', 29),
  ('graph-classification', 29),
  ('recurrent-neural-networks', 28),
  ('adversarial-learning', 28)]
1
2
3
4
5
# Filter tags that have fewer than <min_tag_freq> occurances
min_tag_freq = 30
tags_above_freq = Counter(tag for tag in tags.elements()
                          if tags[tag] >= min_tag_freq)
df.tags = df.tags.apply(filter, include=list(tags_above_freq.keys()))

Cleaning

After applying our filters, it's important that we remove any samples that didn't make the cut. In our case, we'll want to remove inputs that have no remaining (not enough frequency) tags.

1
2
3
# Remove projects with no more remaining relevant tags
df = df[df.tags.map(len) > 0]
print (f"{len(df)} projects")
1444 projects

And since we're dealing with text data, we can apply some of the common preparation processes:

  1. lower (conditional)
    1
    text = text.lower()
    
  2. remove stopwords (from NLTK package)
    1
    2
    3
    import re
    pattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b\s*')
    text = pattern.sub('', text)
    
  3. spacing and filters
    1
    2
    3
    4
    text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
    text = re.sub(filters, r"", text)
    text = re.sub(' +', ' ', text)  # remove multiple spaces
    text = text.strip()
    
  4. remove URLs using regex (discovered during EDA)
    1
    text = re.sub(r'http\S+', '', text)
    
  5. stemming (conditional)
    1
    text = " ".join([porter.stem(word) for word in text.split(' ')])
    

We can apply our preprocessing steps to our text feature in the dataframe by wrapping all these processes under a function.

1
2
3
# Apply to dataframe
preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=False)
print (f"{df.text.values[0]}\n\n{preprocessed_df.text.values[0]}")
Albumentations Fast image augmentation library and easy to use wrapper around other libraries.
albumentations fast image augmentation library easy use wrapper around libraries

Transformations

Many of the transformations we're going to do are model specific. For example, for our simple baselines we may do label encoding โ†’ tf-idf while for the more involved architectures we may do label encoding โ†’ one-hot encoding โ†’ embeddings. So we'll cover these in the next suite of lessons as we implement each of the baselines.

Note

In the next section we'll be performing exploratory data analysis (EDA) on our preprocessed dataset. However, the order of the steps can be reversed depending on how well the problem is defined. If we're unsure about how to prepare the data, we can use EDA to figure it out. In fact in our dashboard lesson, we can interactively apply data processing and EDA back and forth until we have finalized on constraints.


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Preprocessing - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2021}
}