Skip to content

Preprocessing Data for Machine Learning

Goku Mohandas
· ·

Preparing and transforming our data for modeling.
Repository Β· Notebook

πŸ“¬  Receive new lessons straight to your inbox (once a month) and join 20K+ developers in learning how to responsibly deliver value with ML.

Intuition

Data preprocessing can be categorized into two types of processes: preparation and transformation.

Note

Certain preprocessing steps are global (don't depend on our dataset, ex. removing stop words) and others are local (constructs are learned only from the training split, ex. vocabulary). For the local, dataset-dependent preprocessing steps, we want to ensure that we split the data first before preprocessing to avoid data leaks.

Preparing

Preparing the data involves organizing and cleaning the data.

  • joins:
    • performing SQL joins with existing data tables to organize all the relevant data you need in one view.
  • typing:
    • ensure that all the values for a specific feature are of the same data type, otherwise you won't be able to compare them.
  • missing values:
    • omit samples with missing values (if only a small subset are missing it)
    • omit the entire feature (if too many samples are missing the value)
    • fill in missing values for features (using domain knowledge, heuristics, etc.)
    • may not always seem "missing" (ex. 0, null, NA, etc.)
  • outliers (anomalies):
    • craft assumptions about what is a "normal" expected value
    • be careful about removing important outliers (ex. fraud)
    • anomalies can be global (point), contextual (conditional) or collective
  • clean:
    • use domain expertise and EDA
    • images (crop, resize, clip, etc.)
    • text (lower, stem, lemmatize, regex, etc.)

Note

You need to clean your data first before splitting, at least for the features that splitting depends on. So the process is more like: preprocessing (global, cleaning) β†’ splitting β†’ preprocessing (local, transformations). We covered splitting first since many preprocessing transformations depend on the training split.

Transforming

Transforming the data involves feature encoding and engineering.

Warning

Before transformation, be sure to detect (and potentially remove) outliers using distributions and/or domain expertise. You should constantly revisit these explicit decisions because they may change over time and you don’t want to be including or removing data you shouldn’t be.

Scaling

  • required for most models that are not decision tree based
  • learn constructs from train split and apply to other splits (local)
  • don't blindly scale features (ex. categorical features)

  • standardization: rescale values to mean 0, std 1

    1
    2
    3
    4
    5
    6
    7
    8
    # Standardization
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:\n", x)
    print (f"mean: {np.mean(x):.2f}, std: {np.std(x):.2f}")
    x_standardized = (x - np.mean(x)) / np.std(x)
    print ("x_standardized:\n", x_standardized)
    print (f"mean: {np.mean(x_standardized):.2f}, std: {np.std(x_standardized):.2f}")
    
    x: [0.36769939 0.82302265 0.9891467  0.56200803]
    mean: 0.69, std: 0.24
    x_standardized: [-1.33285946  0.57695671  1.27375049 -0.51784775]
    mean: 0.00, std: 1.00
    

  • min-max: rescale values between a min and max

    1
    2
    3
    4
    5
    6
    7
    8
    # Min-max
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:", x)
    print (f"min: {x.min():.2f}, max: {x.max():.2f}")
    x_scaled = (x - x.min()) / (x.max() - x.min())
    print ("x_scaled:", x_scaled)
    print (f"min: {x_scaled.min():.2f}, max: {x_scaled.max():.2f}")
    
    x: [0.20195674 0.99108855 0.73005081 0.02540603]
    min: 0.03, max: 0.99
    x_scaled: [0.18282479 1.         0.72968575 0.        ]
    min: 0.00, max: 1.00
    

  • binning: convert a continuous feature into categorical using bins

    1
    2
    3
    4
    5
    6
    7
    8
    # Binning
    import numpy as np
    x = np.random.random(4) # values between 0 and 1
    print ("x:", x)
    bins = np.linspace(0, 1, 5) # bins between 0 and 1
    print ("bins:", bins)
    binned = np.digitize(x, bins)
    print ("binned:", binned)
    
    x: [0.54906364 0.1051404  0.2737904  0.2926313 ]
    bins: [0.   0.25 0.5  0.75 1.  ]
    binned: [3 1 2 2]
    

  • and many more!

Note

When we move our code from notebooks to Python scripts, we'll be testing all our preprocessing functions (these workflows can also be captured in feature stores and applied as features are updated).

Encoding

  • allows for representing data efficiently (maintains signal) & effectively (learns pattern)

  • label: unique index for categorical value

    1
    2
    3
    4
    5
    6
    7
    8
    # Label encoding
    label_encoder.class_to_index = {
    "attention": 0,
    "autoencoders": 1,
    "convolutional-neural-networks": 2,
    "data-augmentation": 3,
    ... }
    label_encoder.transform(["attention", "data-augmentation"])
    
    array([2, 2, 1])
    

  • one-hot: representation as binary vector

    1
    2
    # One-hot encoding
    one_hot_encoder.transform(["attention", "data-augmentation"])
    
    array([1, 0, 0, 1, 0, ..., 0])
    

  • embeddings: dense representations capable of representing context

    1
    2
    3
    4
    5
    # Embeddings
    self.embeddings = nn.Embedding(
        embedding_dim=embedding_dim, num_embeddings=vocab_size)
    x_in = self.embeddings(x_in)
    print (x_in.shape)
    
    (len(X), embedding_dim)
    

  • target: represent a categorical feature with the average of the target values that share that categorical value

  • and many more!

Extraction

  • signal extraction from existing features
  • combine existing features
  • transfer learning: using a pretrained model as a feature extractor and finetuning on it's results
  • autoencoders: learn to encode inputs for compressed knowledge representation

  • principle component analysis (PCA): linear dimensionality reduction to project data in a lower dimensional space.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    # PCA
    import numpy as np
    from sklearn.decomposition import PCA
    X = np.array([[-1, -1, 3], [-2, -1, 2], [-3, -2, 1]])
    pca = PCA(n_components=2)
    pca.fit(X)
    print (pca.transform(X))
    print (pca.explained_variance_ratio_)
    print (pca.singular_values_)
    
    [[-1.44245791 -0.1744313 ]
     [-0.1148688   0.31291575]
     [ 1.55732672 -0.13848446]]
    [0.96838847 0.03161153]
    [2.12582835 0.38408396]
    

  • counts (ngram): sparse representation of text as matrix of token counts β€” useful if feature values have lot's of meaningful, separable signal.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    # Counts (ngram)
    from sklearn.feature_extraction.text import CountVectorizer
    y = [
        'acetyl acetone',
        'acetyl chloride',
        'chloride hydroxide',
    ]
    vectorizer = CountVectorizer()
    y = vectorizer.fit_transform(y)
    print (vectorizer.get_feature_names())
    print (y.toarray())
    # πŸ’‘ Repeat above with char-level ngram vectorizer
    # vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3)) # uni, bi and trigrams
    
    ['acetone', 'acetyl', 'chloride', 'hydroxide']
    [[1 1 0 0]
     [0 1 1 0]
     [0 0 1 1]]
    

  • similarity: similar to count vectorization but based on similarities in tokens

  • and many more!

Note

Often, teams will want to reuse the same features for different tasks so how can we avoid duplication of efforts? A solution is feature stores which will enable sharing of features and the workflows around feature pipelines. We'll cover feature stores during Production.

Application

Our dataset is pretty straight forward so we'll only need to use a few of the preprocessing techniques from above. To prepare our data, we're going to clean up our input text (all global actions). 1. lower (conditional)

1
text = text.lower()
2. remove stopwords (NLTK package)
1
2
3
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords) + r')\b\s*')
text = pattern.sub('', text)
3. spacing and filters
1
2
3
4
text = re.sub(r"([-;;.,!?<=>])", r" \1 ", text)
text = re.sub(filters, r"", text)
text = re.sub(' +', ' ', text)  # remove multiple spaces
text = text.strip()
4. remove URLs using regex (discovered during EDA)
1
text = re.sub(r'http\S+', '', text)
5. stemming (conditional)
1
text = " ".join([porter.stem(word) for word in text.split(' ')])

We can apply our preprocessing steps to our text feature in the dataframe.

1
2
3
4
# Apply to dataframe
preprocessed_df = df.copy()
preprocessed_df.text = preprocessed_df.text.apply(preprocess, lower=True, stem=False)
print (f"{df.text.values[0]}\n\n{preprocessed_df.text.values[0]}")

Albumentations Fast image augmentation library and easy to use wrapper around other libraries.
albumentations fast image augmentation library easy use wrapper around libraries

Info

Our data splits were dependent only on the target labels (tags) which were already cleaned. However, if your splits depend on other features as well, you need to at least clean them first before splitting. So the process is more like: preprocessing (global, cleaning) β†’ splitting β†’ preprocessing (local, transformations).

Many of the transformations we're going to do are model specific. For example, for our simple baselines we may do label encoding β†’ tf-idf while for the more involved architectures we may do label encoding β†’ one-hot encoding β†’ embeddings. So we'll cover these in the next suite of lessons as we implement each baseline.


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    title  = "Preprocessing - Made With ML",
    author = "Goku Mohandas",
    url    = "https://madewithml.com/courses/mlops/preprocessing/"
    year   = "2021",
}