Exploratory Data Analysis

Accompanying video for this lesson. Subscribe for updates!


Exploratory data analysis (EDA) is a vital (and fun) step in the data science process but it’s often misconstrued. Here’s how to think about EDA:

Watch from 0:00 for a video walkthrough of this section.


The code for this section can be found here.

Q1. How many (post filtered) tags do the projects have? We care about this because we want to make sure we don’t overwhelm the user with too many tags (UX constraint).

Distribution of tag counts per project

Q2. What are the most popular tags? We care about this because it’s important to know about the distribution of tags and what tags just made the cut (for performance).

Distribution of tags across projects

Q3. Is there enough signal in the title and description that’s unique to each tag? This is important because we want to verify our initial hypothesis that the project’s title and description are highly influential features.

Wordcloud for the tag pytorch

Watch from 3:27 to see what all of this looks like in code.

All of the work we’ve done so far are inside IPython notebooks but in a later lesson, we’ll transfer all of this into an interactive dashboard using a tool called Streamlit.

Watch from 1:37 for a video walkthrough of this section.


[ matplotlib ]