Repository · Notebook
Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
We'll start by first preparing our data by ingesting it from source and splitting it into training, validation and test data splits.
Our data could reside in many different places (databases, files, etc.) and exist in different formats (CSV, JSON, Parquet, etc.). For our application, we'll load the data from a CSV file to a Pandas DataFrame using the
Here is a quick refresher on the Pandas library.
1 2 3 4
|0||6||2020-02-20 06:43:18||Comparison between YOLO and RCNN on real world...||Bringing theory to experiment is cool. We can ...||computer-vision|
|1||7||2020-02-20 06:47:21||Show, Infer & Tell: Contextual Inference for C...||The beauty of the work lies in the way it arch...||computer-vision|
|2||9||2020-02-24 16:24:45||Awesome Graph Classification||A collection of important graph embedding, cla...||other|
|3||15||2020-02-28 23:55:26||Awesome Monte Carlo Tree Search||A curated list of Monte Carlo tree search pape...||other|
|4||25||2020-03-07 23:04:31||AttentionWalk||A PyTorch Implementation of "Watch Your Step: ...||other|
In our data engineering lesson we'll look at how to continually ingest data from more complex sources (ex. data warehouses)
Next, we need to split our training dataset into
val data splits.
- Use the
trainsplit to train the model.
Here the model will have access to both inputs (features) and outputs (labels) to optimize its internal weights.
- After each iteration (epoch) through the training split, we will use the
valsplit to determine the model's performance.
Here the model will not use the labels to optimize its weights but instead, we will use the validation performance to optimize training hyperparameters such as the learning rate, etc.
- Finally, we will use a separate holdout
testdataset to determine the model's performance after training.
This is our best measure of how the model may behave on new, unseen data that is from a similar distribution to our training dataset.
For our application, we will have a training dataset to split into
val splits and a separate testing dataset for the
test set. While we could have one large dataset and split that into the three splits, it's a good idea to have a separate test dataset. Over time, our training data may grow and our test splits will look different every time. This will make it difficult to compare models against other models and against each other.
We can view the class counts in our dataset by using the
tag natural-language-processing 310 computer-vision 285 other 106 mlops 63 Name: count, dtype: int64
For our multi-class task (where each project has exactly one tag), we want to ensure that the data splits have similar class distributions. We can achieve this by specifying how to stratify the split by using the
stratify keyword argument with sklearn's
Creating proper data splits
What are the criteria we should focus on to ensure proper data splits?
- the dataset (and each data split) should be representative of data we will encounter
- equal distributions of output values across all splits
- shuffle your data if it's organized in a way that prevents input variance
- avoid random shuffles if your task can suffer from data leaks (ex.
1 2 3
How can we validate that our data splits have similar class distributions? We can view the frequency of each class in each split:
tag natural-language-processing 248 computer-vision 228 other 85 mlops 50 Name: count, dtype: int64
Before we view our validation split's class counts, recall that our validation split is only
test_size of the entire dataset. So we need to adjust the value counts so that we can compare it to the training split's class counts.
tag natural-language-processing 248 computer-vision 228 other 84 mlops 52 Name: count, dtype: int64
These adjusted counts looks very similar to our train split's counts. Now we're ready to explore our dataset!
Upcoming live cohorts
Sign up for our upcoming live cohort, where we'll provide live lessons + QA, compute (GPUs) and community to learn everything in one day.
To cite this content, please use:
1 2 3 4 5 6