📬 Receive new lessons straight to your inbox (once a month) and join 30K+ developers in learning how to responsibly deliver value with ML.
First we'll import the NumPy and Pandas libraries and set seeds for reproducibility. We'll also download the dataset we'll be working with to disk.
# Set seed for reproducibilitynp.random.seed(seed=1234)
We're going to work with the Titanic dataset which has data on the people who embarked the RMS Titanic in 1912 and whether they survived the expedition or not. It's a very common and rich dataset which makes it very apt for exploratory data analysis with Pandas.
Let's load the data from the CSV file into a Pandas dataframe. The header=0 signifies that the first row (0th index) is a header row which contains the names of each column in our dataset.
# Read from CSV to Pandas DataFrameurl="https://raw.githubusercontent.com/GokuMohandas/MadeWithML/main/datasets/titanic.csv"df=pd.read_csv(url,header=0)
# First few itemsdf.head(3)
Allen, Miss. Elisabeth Walton
Allison, Master. Hudson Trevor
Allison, Miss. Helen Loraine
These are the different features:
class: class of travel
name: full name of the passenger
age: numerical age
sibsp: # of siblings/spouse aboard
parch: number of parents/child aboard
ticket: ticket number
fare: cost of the ticket
cabin: location of room
emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q - Queenstown)
survived: survial metric (0 - died, 1 - survived)
Exploratory data analysis (EDA)
Now that we loaded our data, we're ready to start exploring it to find interesting information.
Be sure to check out our entire lesson focused on EDA in our mlops course.
We can use .describe() to extract some standard details about our numerical features.
We're now going to use feature engineering to create a column called family_size. We'll first define a function called get_family_size that will determine the family size using the number of parents and siblings.
# Lambda expressions to create new featuresdefget_family_size(sibsp,parch):family_size=sibsp+parchreturnfamily_size
Once we define the function, we can use lambda to apply that function on each row (using the numbers of siblings and parents in each row to determine the family size for each row).