Skip to content

Exploratory Data Analysis (EDA)


Exploring our dataset for insights with intention.
Goku Mohandas
· ·
Repository ยท Notebook

๐Ÿ“ฌ  Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.

Intuition

Exploratory data analysis (EDA) is a vital (and fun) step in the data science process but it's often misconstrued. Here's how to think about EDA:

  • not just to visualize a prescribed set of plots (correlation matrix, etc.).
  • goal is to convince yourself that the data you have is sufficient for the task.
  • use EDA to answer important questions and to make it easier to extract insight
  • not a one time process; as your data grows, you want to revisit EDA to catch distribution shifts, anomalies, etc.

Application

Let's answer a few key questions for our application using EDA.

1
2
3
4
5
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from wordcloud import WordCloud, STOPWORDS
warnings.filterwarnings("ignore")

Tags per project

Q1. How many (post filtered) tags do the projects have? We care about this because we want to make sure we don't overwhelm the user with too many tags (UX constraint).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Number of tags per project
num_tags_per_project = [len(tags) for tags in df.tags]
num_tags, num_projects = zip(*Counter(num_tags_per_project).items())
plt.figure(figsize=(10, 3))
ax = sns.barplot(list(num_tags), list(num_projects))
plt.title("Tags per project", fontsize=20)
plt.xlabel("Number of tags", fontsize=16)
ax.set_xticklabels(range(1, len(num_tags)+1), rotation=0, fontsize=16)
plt.ylabel("Number of projects", fontsize=16)
plt.show()

Tag distribution

Q2. What are the most popular tags? We care about this because it's important to know about the distribution of tags and what tags just made the cut (for performance).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Distribution of tags
all_tags = list(itertools.chain.from_iterable(df.tags.values))
tags, tag_counts = zip(*Counter(all_tags).most_common())
plt.figure(figsize=(25, 5))
ax = sns.barplot(list(tags), list(tag_counts))
plt.title("Tag distribution", fontsize=20)
plt.xlabel("Tag", fontsize=16)
ax.set_xticklabels(tags, rotation=90, fontsize=14)
plt.ylabel("Number of projects", fontsize=16)
plt.show()

Wordcloud

Q3. Is there enough signal in the title and description that's unique to each tag? This is important because we want to verify our initial hypothesis that the project's title and description are highly influential features.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@widgets.interact(tag=list(tags))
def display_word_cloud(tag="pytorch"):
    # Plot word clouds top top tags
    plt.figure(figsize=(15, 5))
    subset = df[df.tags.apply(lambda tags: tag in tags)]
    text = subset.text.values
    cloud = WordCloud(
        stopwords=STOPWORDS, background_color="black", collocations=False,
        width=500, height=300).generate(" ".join(text))
    plt.axis("off")
    plt.imshow(cloud)

All of the work we've done so far are inside IPython notebooks but in our dashboard lesson, we'll transfer all of this into an interactive dashboard using Streamlit.

Resources


To cite this lesson, please use:

1
2
3
4
5
6
@article{madewithml,
    author       = {Goku Mohandas},
    title        = { Exploration - Made With ML },
    howpublished = {\url{https://madewithml.com/}},
    year         = {2021}
}