Exploratory Data Analysis (EDA)
Repository · Notebook
📬 Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
Exploratory data analysis (EDA) to understand the signals and nuances of our dataset. It's a cyclical process that can be done at various points of our development process (before/after labeling, preprocessing, etc. depending on how well the problem is defined. For example, if we're unsure how to label or preprocess our data, we can use EDA to figure it out.
We're going to start our project with EDA, a vital (and fun) process that's often misconstrued. Here's how to think about EDA:
- not just to visualize a prescribed set of plots (correlation matrix, etc.).
- goal is to convince yourself that the data you have is sufficient for the task.
- use EDA to answer important questions and to make it easier to extract insight
- not a one time process; as your data grows, you want to revisit EDA to catch distribution shifts, anomalies, etc.
Let's answer a few key questions using EDA.
1 2 3 4 5 6 7 8
How many data points do we have per tag?
1 2 3 4 5 6 7 8 9
1 2 3
[('natural-language-processing', 388), ('computer-vision', 356), ('mlops', 79), ('reinforcement-learning', 56), ('graph-learning', 45), ('time-series', 31)]
We'll address the data imbalance after splitting into our train split and prior to training our model.
Is there enough signal in the title and description that's unique to each tag? This is important because we want to verify our initial hypothesis that the project's title and description are highly influential features.
1 2 3 4 5 6 7 8 9 10 11 12
Looks like the
title text feature has some good signal for the respective classes and matches our intuition. We can repeat this for the
description text feature as well. This information will become useful when we decide how to use our features for modeling.
All of the work we've done so far are inside IPython notebooks but in our dashboard lesson, we'll transfer all of this into an interactive dashboard using Streamlit.
To cite this content, please use:
1 2 3 4 5 6