If you are new to the world of data science, Python's Pandas libraries are some of the best tools for quick data analysis. Pandas are built on Numpy, another popular Python library.
The purpose of this tutorial is to provide 3 tips and tricks with plenty of practice examples that should be part of your tool kit as a Data Scientist.
You will use data from one of the most infamous shipwrecks, the sinking of the RMS Titanic. You've almost certainly heard stories about the sinking of the Titanic in 1912 when it hit an iceberg on its maiden voyage. Astonishingly, there were not enough lifeboats available for all the passengers. This tragic tale got even more interesting and personal when I found out that one of my friend's great-grandfather was on the Titanic. It get's worse - he was the helmsman! Two facts that most people seem to remember are. 1. The women were given preference to the lifeboats over the men and as a result, much more survived. 2. Those in first class were given access to the lifeboats over those in other classes.
In this tutorial, you will have access to a subset of the data and confirm these facts for yourself.
The Kaggle team have generously provided two data sets to perform some machine learning tasks. (https://www.kaggle.com/c/titanic/data) For the purposes of this tutorial we will only use one of them and so we will only work with a partial dataset for the passenger list on the Titanic.
Let's get started!
import pandas as pd
import numpy as np
passengers = pd.read_csv("titanic.csv")
Before you start on any of the Panda tools, you will want to examine the dataframe to get an overview of your dataset. You will notice straight away that although the data set has 891 entries, you have missing values for Age, Cabin and Embarked.
<code class="sourceCode python">passengers.info()</code>
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB
<code class="sourceCode python">passengers.head(<span class="dv">5</span>)</code>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley ... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath ... | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1 Dealing with missing values
You can see that there are missing values in the data. There are a couple of ways you can deal with this. We will only show 5 example outputs to keep the tutorial short but you can run the operations on the whole dataset if you are curious.
1.1 Drop the missing values.
We can use Panda's dropna()
.
passengers.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
This will drop any rows with missing values. Clearly this isn't a good idea in this instance because we will reduce the number of passengers in our data set from 891 to 183. Fortunately, Pandas doesn't make any of the changes to your dataframe object until you change the inplace=False flag to True.
What instead if we wanted to remove any columns with missing values? Modify the code in the line below. The original dataset has 12 columns. As Age, Cabin and Embarked all have null values, you should end up with only 9 columns.
Modify the line below to remove any columns with missing values.
<code class="sourceCode python">tt.dropna().head(<span class="dv">5</span>)</code>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | Cumings, Mrs. John Bradley ... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath ... | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 1 | 3 | Sandstrom, Miss. Marguerite ... | female | 4 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58 | 0 | 0 | 113783 | 26.5500 | C103 | S |
Here is the solution:
# Solution. Columns are axis 1.
passengers.dropna(axis=1).head(5)
PassengerId | Survived | Pclass | Name | Sex | SibSp | Parch | Ticket | Fare |
---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 1 | 0 | A/5 21171 | 7.2500 |
2 | 1 | 1 | Cumings, Mrs. John Bradley... | female | 1 | 0 | PC 17599 | 71.2833 |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 0 | 0 | STON/O2. 3101282 | 7.9250 |
4 | 1 | 1 | Futrelle, Mrs. Jacques ... | female | 1 | 0 | 113803 | 53.1000 |
5 | 0 | 3 | Allen, Mr. William Henry | male | 0 | 0 | 373450 | 8.0500 |
Another option that you have is to set a threshold. So, for example, you can specify that if 11 of a row's 12 fields has a non-null value, you won't drop that row. You are now left with 733 entries from the original 891 - a definite improvement on 183 with dropping any rows with missing values. Modify the line above to confirm this is the case. Dropping missing values isn't a good option here. Let's try another of Panda's options, filling the missing values.
Modify the line below to set a threshold of 11 non-null values for a row.
<code class="sourceCode python">passengers.dropna().head(<span class="dv">5</span>)</code>
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | Cumings, Mrs. John Bradley ... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath ... | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54 | 0 | 0 | 17463 | 51.8625 | E46 | S |
11 | 1 | 3 | Sandstrom, Miss. Marguerite ... | female | 4 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58 | 0 | 0 | 113783 | 26.5500 | C103 | S |
Here is the solution:
# Solution
passengers.dropna(thresh=11).head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley ... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath ... | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.2 Filling the missing values
There are many statistical schools of thought on what missing or null values should be replaced with. You already know from passengers.describe()
that the average(mean) age is 29.69. You can see that the 6th entry below have missing values replaced.
# The mean
passengers["Age"].fillna(value=29.69).head(10)
0 22.00
1 38.00
2 26.00
3 35.00
4 35.00
5 29.69
6 54.00
7 2.00
8 27.00
9 14.00
Name: Age, dtype: float64
The median is another good value that we could use. Modify the code above to use the median.
Complete the following line to fill in the missing values with the median age
<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span></code>
File "<ipython-input-17-171bf02c8336>", line 1
passengers["Age"].fillna(value=
^
SyntaxError: unexpected EOF while parsing
Here is the solution:
<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span>passengers[<span class="st">"Age"</span>].median()).head(<span class="dv">5</span>)</code>
0 22
1 38
2 26
3 35
4 35
Name: Age, dtype: float64
2 The apply function
This is one of the most powerful tools available in Pandas. Apply, allows you to either use Python's built-in functions or to create your own custom function and then run it across a set of your data.
Let's take a look at an example where we create own function and then "apply" that across a set of data.
The Titanic set off from Southampton, stopping off at Cherbourg and Queenstown.
I don't know about you, but I always struggle with remembering the names of French towns and ports. To help me out, I've written a function, get_full_city_name
, that converts the one letter code of the Embarked
column to its full name, for all the towns, French or otherwise . Using Panda's apply function, I can then run this throughout the entire column as shown below. Note that the function apply
is not modifying the original data source but producing a new output instead.
def get_full_city_name(cityCode):
if (cityCode == "S"):
return "Southampton"
elif (cityCode == "C"):
return "Cherbourg"
elif (cityCode == "Q"):
return "Queenstown"
passengers["Embarked"].apply(get_full_city_name).head(10)
0 Southampton
1 Cherbourg
2 Southampton
3 Southampton
4 Southampton
5 Queenstown
6 Southampton
7 Southampton
8 Southampton
9 Cherbourg
Name: Embarked, dtype: object
Now let's use the apply
function, but this time using a lambda expression. Lambda expressions allow you to create functions on the fly. This means you can create a function when you need it and then apply that to the relevant data set.
What would have been the price of the tickets on the Titanic in today's money? Taking into account inflation, prices have increased by about 106 times. In the one-liner below – and that's part of the reason why lambda functions are so useful – the lambda function takes as input the Fare from the data frame, multiplies this by 106 returning today's fare.
# Updating the fares for all the passengers
passengers["Fare"].apply(lambda fare: 106 * fare).head(5)
0 768.5000
1 7556.0298
2 840.0500
3 5628.6000
4 853.3000
Name: Fare, dtype: float64
What change would you have to make to the line below, to determine the maximum fare in today's money?
How can you modify the line below to show the highest fare?
<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).head(<span class="dv">5</span>)</code>
0 768.5000
1 7556.0298
2 840.0500
3 5628.6000
4 853.3000
Name: Fare, dtype: float64
Here is the solution:
<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).<span class="bu">max</span>()</code>
<code>54306.895199999999</code>
Want to find out more about the apply
function? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply
Recap
- Dealing with missing values (
dropna()
andfillna()
) gives you the freedom to determine how you will deal with missing data. apply()
enables you to run a function (a built-in one or one you have created) across a series of data.
I hope you have had fun exploring this data set. These are the top 3 Panda tips I recommend every Data Scientist has in their tool kit. Let me know in the comments section below if you think another Panda tool should have made the list.
About the Author
Jonathan Fernandes
Jonathan has an undergraduate in Computer Science and an MBA from Warwick University. Data Science combines his love for numbers, coding and statistics. He enjoys working together with people to make sense of their data, allowing them to make more informed decisions.
Enquire now
Fill out the following form and we’ll contact you within one business day to discuss and answer any questions you have about the programme. We look forward to speaking with you.