# Data Preprocessing

Preprocessing our dataset, via through preparations and transformations, to use for training.
Goku Mohandas
· · ·
Repository · Notebook

📬  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

## Intuition

Data preprocessing can be categorized into two types of processes: preparation and transformation. We'll explore common preprocessing techniques and then walk through the relevant processes for our specific application.

Warning

Certain preprocessing steps are global (don't depend on our dataset, ex. lower casing text, removing stop words, etc.) and others are local (constructs are learned only from the training split, ex. vocabulary, standardization, etc.). For the local, dataset-dependent preprocessing steps, we want to ensure that we split the data first before preprocessing to avoid data leaks.

## Preparing

Preparing the data involves organizing and cleaning the data.

### Joins

Performing SQL joins with existing data tables to organize all the relevant data you need into one view. This makes working with our dataset a whole lot easier.

 1 2 SELECT * FROM A INNER JOIN B on A.id == B.id 

Warning

We need to be careful to perform point-in-time valid joins to avoid data leaks. For example, if Table B may have features for objects in Table A that were not available at the time inference would have been needed.

### Missing values

First, we'll have to identify the rows with missing values and once we do, there are several approaches to dealing with them.

• omit samples with missing values (if only a small subset are missing it)

 1 2 3 4 5 6 # Drop a row (sample) by index df.drop([4, 10, ...]) # Conditionally drop rows (samples) df = df[df.value > 0] # Drop samples with any missing feature df = df[df.isnull().any(axis=1)] 

• omit the entire feature (if too many samples are missing the value)

 1 2 # Drop a column (feature) df.drop(["A"], axis=1) 

• fill in missing values for features (using domain knowledge, heuristics, etc.)

 1 2 # Fill in missing values with mean df.A = df.A.fillna(df.A.mean()) 

• may not always seem "missing" (ex. 0, null, NA, etc.)

 1 2 3 # Replace zeros to NaNs import numpy as np df.A = df.A.replace({"0": np.nan, 0: np.nan}) 

### Outliers (anomalies)

• craft assumptions about what is a "normal" expected value  1 2 # Ex. Feature value must be within 2 standard deviations df[np.abs(df.A - df.A.mean()) <= (2 * df.A.std())] 
• be careful not to remove important outliers (ex. fraud)
• values may not be outliers when we apply a transformation (ex. power law)
• anomalies can be global (point), contextual (conditional) or collective (individual points are not anomalous and the collective group is an outlier)

### Feature engineering

Feature engineering involves combining features in unique ways to draw out signal.

 1 2 # Input df.C = df.A + df.B 

Tip

Feature engineering can be done in collaboration with domain experts that can guide us on what features to engineer and use.

### Cleaning

Cleaning our data involves apply constraints to make it easier for our models to draw our signal from the data.

• use domain expertise and EDA
• apply constraints via filters
• ensure data type consistency
• removing data points with certain or null column values
• images (crop, resize, clip, etc.)  1 2 3 4 # Resize import cv2 dims = (height, width) resized_img = cv2.resize(src=img, dsize=dims, interpolation=cv2.INTER_LINEAR) 
• text (lower, stem, lemmatize, regex, etc.)  1 2 # Lower case the text text = text.lower() 

## Transformations

Transforming the data involves feature encoding and engineering.

### Scaling

• required for models where the scale of the input affects the processes
• learn constructs from train split and apply to other splits (local)
• don't blindly scale features (ex. categorical features)

• standardization: rescale values to mean 0, std 1

 1 2 3 4 5 6 7 8 # Standardization import numpy as np x = np.random.random(4) # values between 0 and 1 print ("x:\n", x) print (f"mean: {np.mean(x):.2f}, std: {np.std(x):.2f}") x_standardized = (x - np.mean(x)) / np.std(x) print ("x_standardized:\n", x_standardized) print (f"mean: {np.mean(x_standardized):.2f}, std: {np.std(x_standardized):.2f}") 
x: [0.36769939 0.82302265 0.9891467  0.56200803]
mean: 0.69, std: 0.24
x_standardized: [-1.33285946  0.57695671  1.27375049 -0.51784775]
mean: 0.00, std: 1.00


• min-max: rescale values between a min and max

 1 2 3 4 5 6 7 8 # Min-max import numpy as np x = np.random.random(4) # values between 0 and 1 print ("x:", x) print (f"min: {x.min():.2f}, max: {x.max():.2f}") x_scaled = (x - x.min()) / (x.max() - x.min()) print ("x_scaled:", x_scaled) print (f"min: {x_scaled.min():.2f}, max: {x_scaled.max():.2f}") 
x: [0.20195674 0.99108855 0.73005081 0.02540603]
min: 0.03, max: 0.99
x_scaled: [0.18282479 1.         0.72968575 0.        ]
min: 0.00, max: 1.00


• binning: convert a continuous feature into categorical using bins

 1 2 3 4 5 6 7 8 # Binning import numpy as np x = np.random.random(4) # values between 0 and 1 print ("x:", x) bins = np.linspace(0, 1, 5) # bins between 0 and 1 print ("bins:", bins) binned = np.digitize(x, bins) print ("binned:", binned) 
x: [0.54906364 0.1051404  0.2737904  0.2926313 ]
bins: [0.   0.25 0.5  0.75 1.  ]
binned: [3 1 2 2]


• and many more!

### Encoding

• allows for representing data efficiently (maintains signal) and effectively (learns patterns, ex. one-hot vs embeddings)

• label: unique index for categorical value

 1 2 3 4 5 6 7 8 # Label encoding label_encoder.class_to_index = { "attention": 0, "autoencoders": 1, "convolutional-neural-networks": 2, "data-augmentation": 3, ... } label_encoder.transform(["attention", "data-augmentation"]) 
array([2, 2, 1])


• one-hot: representation as binary vector

 1 2 # One-hot encoding one_hot_encoder.transform(["attention", "data-augmentation"]) 
array([1, 0, 0, 1, 0, ..., 0])


• embeddings: dense representations capable of representing context

 1 2 3 4 5 # Embeddings self.embeddings = nn.Embedding( embedding_dim=embedding_dim, num_embeddings=vocab_size) x_in = self.embeddings(x_in) print (x_in.shape) 
(len(X), embedding_dim)


• and many more!

### Extraction

• signal extraction from existing features
• combine existing features
• transfer learning: using a pretrained model as a feature extractor and finetuning on it's results
• autoencoders: learn to encode inputs for compressed knowledge representation

• principle component analysis (PCA): linear dimensionality reduction to project data in a lower dimensional space.

 1 2 3 4 5 6 7 8 9 # PCA import numpy as np from sklearn.decomposition import PCA X = np.array([[-1, -1, 3], [-2, -1, 2], [-3, -2, 1]]) pca = PCA(n_components=2) pca.fit(X) print (pca.transform(X)) print (pca.explained_variance_ratio_) print (pca.singular_values_) 
[[-1.44245791 -0.1744313 ]
[-0.1148688   0.31291575]
[ 1.55732672 -0.13848446]]
[0.96838847 0.03161153]
[2.12582835 0.38408396]


• counts (ngram): sparse representation of text as matrix of token counts — useful if feature values have lot's of meaningful, separable signal.

  1 2 3 4 5 6 7 8 9 10 11 12 13 # Counts (ngram) from sklearn.feature_extraction.text import CountVectorizer y = [ "acetyl acetone", "acetyl chloride", "chloride hydroxide", ] vectorizer = CountVectorizer() y = vectorizer.fit_transform(y) print (vectorizer.get_feature_names()) print (y.toarray()) # 💡 Repeat above with char-level ngram vectorizer # vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3)) # uni, bi and trigrams 
['acetone', 'acetyl', 'chloride', 'hydroxide']
[[1 1 0 0]
[0 1 1 0]
[0 0 1 1]]


• similarity: similar to count vectorization but based on similarities in tokens

• and many more!

We'll often was to retrieve feature values for an entity (user, item, etc.) over time and reuse the same features across different projects. To ensure that we're retrieving the proper feature values and to avoid duplication of efforts, we can use a feature store.

Curse of dimensionality

What can we do if a feature has lots of unique values but enough data points for each unique value (ex. URL as a feature)?

We can encode our data with hashing or using it's attributes instead of the exact entity itself. For example, representing a user by their location and favorites as opposed to using their user ID or representing a webpage with it's domain as opposed to the exact url. This methods effectively decrease the total number of unique feature values and increase the number of data points for each.

## Application

For our application, we'll be implementing a few of these preprocessing steps that are relevant for our dataset.

### Feature engineering

We can combine existing input features to create new meaningful signal (helping the model learn). However, there's usually no simple way to know if certain feature combinations will help or not without empirically experimenting with the different combinations. Here, we could use a project's title and description separately as features but we'll combine them to create one input feature.

 1 2 # Input df["text"] = df.title + " " + df.description 

### Cleaning

Since we're dealing with text data, we can apply some of the common text preprocessing steps:

!pip install nltk==3.7 -q

 1 2 3 4 import nltk from nltk.corpus import stopwords from nltk.stem import PorterStemmer import re 
 1 2 3 nltk.download("stopwords") STOPWORDS = stopwords.words("english") stemmer = PorterStemmer() 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 def clean_text(text, lower=True, stem=False, stopwords=STOPWORDS): """Clean raw text.""" # Lower if lower: text = text.lower() # Remove stopwords if len(stopwords): pattern = re.compile(r'\b(' + r"|".join(stopwords) + r")\b\s*") text = pattern.sub('', text) # Spacing and filters text = re.sub( r"([!\"'#\$%&()*\+,-./:;<=>[email protected]\\^_{|}~])", r" \1 ", text ) # add spacing between objects to be filtered text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars text = re.sub(" +", " ", text) # remove multiple spaces text = text.strip() # strip white space at the ends # Remove links text = re.sub(r"http\S+", "", text) # Stemming if stem: text = " ".join([stemmer.stem(word, to_lowercase=lower) for word in text.split(" ")]) return text 
!!! note
We could definitely try and include emojis, punctuations, etc. because they do have a lot of signal for the task but it's best to simplify the initial feature set we use to just what we think are the most influential and then we can slowly introduce other features and assess utility.

 1 2 3 4 # Apply to dataframe original_df = df.copy() df.text = df.text.apply(clean_text, lower=True, stem=False) print (f"{original_df.text.values[0]}\n{df.text.values[0]}") 
Comparison between YOLO and RCNN on real world videos Bringing theory to experiment is cool. We can easily train models in colab and find the results in minutes.
comparison yolo rcnn real world videos bringing theory experiment cool easily train models colab find results minutes


Warning

We'll want to introduce less frequent features as they become more frequent or encode them in a clever way (ex. binning, extract general attributes, common n-grams, mean encoding using other feature values, etc.) so that we can mitigate the feature value dimensionality issue until we're able to collect more data.

### Replace labels

Based on our findings from EDA, we're going to apply several constraints for labeling our data:

• if a data point has a tag that we currently don't support, we'll replace it with other
• if a certain tag doesn't have enough samples, we'll replace it with other
 1 import json 
 1 2 # Accepted tags (external constraint) ACCEPTED_TAGS = ["natural-language-processing", "computer-vision", "mlops", "graph-learning"] 
 1 2 3 # Out of scope (OOS) tags oos_tags = [item for item in df.tag.unique() if item not in ACCEPTED_TAGS] oos_tags 
['reinforcement-learning', 'time-series']

 1 2 3 # Samples with OOS tags oos_indices = df[df.tag.isin(oos_tags)].index df[df.tag.isin(oos_tags)].head() 


id
created_on
title
description
tag

3
15
2020-02-28 23:55:26
Awesome Monte Carlo Tree Search
A curated list of Monte Carlo tree search papers...
reinforcement-learning

37
121
2020-03-24 04:56:38
Deep Reinforcement Learning in TensorFlow2
deep-rl-tf2 is a repository that implements a ...
reinforcement-learning

67
218
2020-04-06 11:29:57
Distributional RL using TensorFlow2
🐳 Implementation of various Distributional Rei...
reinforcement-learning

74
239
2020-04-06 18:39:48
Prophet: Forecasting At Scale
Tool for producing high quality forecasts for ...
time-series

95
277
2020-04-07 00:30:33
Curriculum for Reinforcement Learning
Curriculum learning applied to reinforcement l...
reinforcement-learning


 1 2 3 # Replace this tag with "other" df.tag = df.tag.apply(lambda x: "other" if x in oos_tags else x) df.iloc[oos_indices].head() 


id
created_on
title
description
tag

3
15
2020-02-28 23:55:26
Awesome Monte Carlo Tree Search
A curated list of Monte Carlo tree search papers...
other

37
121
2020-03-24 04:56:38
Deep Reinforcement Learning in TensorFlow2
deep-rl-tf2 is a repository that implements a ...
other

67
218
2020-04-06 11:29:57
Distributional RL using TensorFlow2
🐳 Implementation of various Distributional Rei...
other

74
239
2020-04-06 18:39:48
Prophet: Forecasting At Scale
Tool for producing high quality forecasts for ...
other

95
277
2020-04-07 00:30:33
Curriculum for Reinforcement Learning
Curriculum learning applied to reinforcement l...
other



We're also going to restrict the mapping to only tags that are above a certain frequency threshold. The tags that don't have enough projects will not have enough samples to model their relationships.

 1 2 3 # Minimum frequency required for a tag min_freq = 75 tags = Counter(df.tag.values) 
  1 2 3 4 5 6 7 8 9 10 # Tags that just made / missed the cut @widgets.interact(min_freq=(0, tags.most_common()[0][1])) def separate_tags_by_freq(min_freq=min_freq): tags_above_freq = Counter(tag for tag in tags.elements() if tags[tag] >= min_freq) tags_below_freq = Counter(tag for tag in tags.elements() if tags[tag] < min_freq) print ("Most popular tags:\n", tags_above_freq.most_common(3)) print ("\nTags that just made the cut:\n", tags_above_freq.most_common()[-3:]) print ("\nTags that just missed the cut:\n", tags_below_freq.most_common(3)) 
Most popular tags:
[('natural-language-processing', 388), ('computer-vision', 356), ('other', 87)]

Tags that just made the cut:
[('computer-vision', 356), ('other', 87), ('mlops', 79)]

Tags that just missed the cut:
[('graph-learning', 45)]

 1 2 3 4 5 def filter(tag, include=[]): """Determine if a given tag is to be included.""" if tag not in include: tag = None return tag 
 1 2 3 4 # Filter tags that have fewer than occurrences tags_above_freq = Counter(tag for tag in tags.elements() if (tags[tag] >= min_freq)) df.tag = df.tag.apply(filter, include=list(tags_above_freq.keys())) 
 1 2 # Fill None with other df.tag = df.tag.fillna("other") 

### Encoding

We're going to encode our output labels where we'll assign each tag a unique index.

 1 2 import numpy as np import random 
 1 2 3 # Get data X = df.text.to_numpy() y = df.tag 

We'll be writing our own LabelEncoder which is based on scikit-learn's implementation. It's an extremely valuable skill to be able to write clean classes for objects we want to create.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 class LabelEncoder(object): """Encode labels into unique indices""" def __init__(self, class_to_index={}): self.class_to_index = class_to_index or {} # mutable defaults ;) self.index_to_class = {v: k for k, v in self.class_to_index.items()} self.classes = list(self.class_to_index.keys()) def __len__(self): return len(self.class_to_index) def __str__(self): return f"" def fit(self, y): classes = np.unique(y) for i, class_ in enumerate(classes): self.class_to_index[class_] = i self.index_to_class = {v: k for k, v in self.class_to_index.items()} self.classes = list(self.class_to_index.keys()) return self def encode(self, y): encoded = np.zeros((len(y)), dtype=int) for i, item in enumerate(y): encoded[i] = self.class_to_index[item] return encoded def decode(self, y): classes = [] for i, item in enumerate(y): classes.append(self.index_to_class[item]) return classes def save(self, fp): with open(fp, "w") as fp: contents = {"class_to_index": self.class_to_index} json.dump(contents, fp, indent=4, sort_keys=False) @classmethod def load(cls, fp): with open(fp, "r") as fp: kwargs = json.load(fp=fp) return cls(**kwargs) 

If you're not familiar with the @classmethod decorator, learn more about it from our Python lesson.

 1 2 3 4 # Encode label_encoder = LabelEncoder() label_encoder.fit(y) num_classes = len(label_encoder) 
 1 label_encoder.class_to_index 

{'computer-vision': 0,
'mlops': 1,
'natural-language-processing': 2,
'other': 3}

 1 label_encoder.index_to_class 
{0: 'computer-vision',
1: 'mlops',
2: 'natural-language-processing',
3: 'other'}

 1 2 # Encode label_encoder.encode(["computer-vision", "mlops", "mlops"]) 
array([0, 1, 1])

 1 2 # Decode label_encoder.decode(np.array([0, 1, 1])) 
['computer-vision', 'mlops', 'mlops']

 1 2 3 # Encode all our labels y = label_encoder.encode(y) print (y.shape) 

Many of the transformations we're going to do on our input text features are model specific. For example, for our simple baselines we may do label encodingtf-idf while for the more involved architectures we may do label encodingone-hot encodingembeddings. So we'll cover these in the next suite of lessons as we implement our baselines.

In the next section we'll be performing exploratory data analysis (EDA) on our preprocessed dataset. However, the order of the steps can be reversed depending on how well the problem is defined. If we're unsure about how to prepare the data, we can use EDA to figure it out and vice versa.

To cite this content, please use:

 1 2 3 4 5 6 @article{madewithml, author = {Goku Mohandas}, title = { Preprocessing - Made With ML }, howpublished = {\url{https://madewithml.com/}}, year = {2022} } `