Exploring our dataset for insights with intention.
Repository ยท Notebook
๐ฌ Receive new lessons straight to your inbox (once a month) and join 20K+ developers in learning how to responsibly deliver value with ML.
Intuition
Exploratory data analysis (EDA) is a vital (and fun) step in the data science process but it's often misconstrued. Here's how to think about EDA:
- not just to visualize a prescribed set of plots (correlation matrix, etc.).
- goal is to convince yourself that the data you currently have is sufficient for the task.
- use EDA to answer important questions and to make it easier to extract insight
- not a one time process; as your data grows, you want to revisit EDA to catch distribution shifts, anomalies, etc.
Application
1 2 3 4 5 |
|
Tags per project
Q1. How many (post filtered) tags do the projects have? We care about this because we want to make sure we don't overwhelm the user with too many tags (UX constraint).
1 2 3 4 5 6 7 8 9 10 |
|
Tag distribution
Q2. What are the most popular tags? We care about this because it's important to know about the distribution of tags and what tags just made the cut (for performance).
1 2 3 4 5 6 7 8 9 10 |
|
Wordcloud
Q3. Is there enough signal in the title and description that's unique to each tag? This is important because we want to verify our initial hypothesis that the project's title and description are highly influential features.
1 2 3 4 5 6 7 8 9 10 11 |
|
Note
All of the work we've done so far are inside IPython notebooks but in our dashboard lesson, we'll transfer all of this into an interactive dashboard using a tool called Streamlit.
Resources
To cite this lesson, please use:
1 2 3 4 5 6 |
|