Cambridge Spark - Data Science Courses /

September 19 2018 /

6 minute read

If you are new to the world of data science, Python's Pandas libraries are some of the best tools for quick data analysis. Pandas are built on Numpy, another popular Python library.

The purpose of this tutorial is to provide 3 tips and tricks with plenty of practice examples that should be part of your tool kit as a Data Scientist.

You will use data from one of the most infamous shipwrecks, the sinking of the RMS Titanic. You've almost certainly heard stories about the sinking of the Titanic in 1912 when it hit an iceberg on its maiden voyage. Astonishingly, there were not enough lifeboats available for all the passengers. This tragic tale got even more interesting and personal when I found out that one of my friend's great-grandfather was on the Titanic. It get's worse - he was the helmsman! Two facts that most people seem to remember are. 1. The women were given preference to the lifeboats over the men and as a result, much more survived. 2. Those in first class were given access to the lifeboats over those in other classes.

In this tutorial, you will have access to a subset of the data and confirm these facts for yourself.

The Kaggle team have generously provided two data sets to perform some machine learning tasks. (https://www.kaggle.com/c/titanic/data) For the purposes of this tutorial we will only use one of them and so we will only work with a partial dataset for the passenger list on the Titanic.

Let's get started!

import pandas as pd

import numpy as np

passengers = pd.read_csv("titanic.csv")

Before you start on any of the Panda tools, you will want to examine the dataframe to get an overview of your dataset. You will notice straight away that although the data set has 891 entries, you have missing values for Age, Cabin and Embarked.

<code class="sourceCode python">passengers.info()</code>

<class 'pandas.core.frame.DataFrame'>

Int64Index: 891 entries, 0 to 890

Data columns (total 12 columns):

PassengerId    891 non-null int64

Survived       891 non-null int64

Pclass         891 non-null int64

Name           891 non-null object

Sex            891 non-null object

Age            714 non-null float64

SibSp          891 non-null int64

Parch          891 non-null int64

Ticket         891 non-null object

Fare           891 non-null float64

Cabin          204 non-null object

Embarked       889 non-null object

dtypes: float64(2), int64(5), object(5)

memory usage: 90.5+ KB

<code class="sourceCode python">passengers.head(<span class="dv">5</span>)</code>

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley ...	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath ...	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

1 Dealing with missing values

You can see that there are missing values in the data. There are a couple of ways you can deal with this. We will only show 5 example outputs to keep the tutorial short but you can run the operations on the whole dataset if you are curious.

1.1 Drop the missing values.

We can use Panda's dropna().

passengers.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)

This will drop any rows with missing values. Clearly this isn't a good idea in this instance because we will reduce the number of passengers in our data set from 891 to 183. Fortunately, Pandas doesn't make any of the changes to your dataframe object until you change the inplace=False flag to True.

What instead if we wanted to remove any columns with missing values? Modify the code in the line below. The original dataset has 12 columns. As Age, Cabin and Embarked all have null values, you should end up with only 9 columns.

Modify the line below to remove any columns with missing values.

<code class="sourceCode python">tt.dropna().head(<span class="dv">5</span>)</code>

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
2	1	1	Cumings, Mrs. John Bradley ...	female	38	1	0	PC 17599	71.2833	C85	C
4	1	1	Futrelle, Mrs. Jacques Heath ...	female	35	1	0	113803	53.1000	C123	S
7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
11	1	3	Sandstrom, Miss. Marguerite ...	female	4	1	1	PP 9549	16.7000	G6	S
12	1	1	Bonnell, Miss. Elizabeth	female	58	0	0	113783	26.5500	C103	S

Here is the solution:

# Solution. Columns are axis 1.

passengers.dropna(axis=1).head(5)

PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare
1	0	3	Braund, Mr. Owen Harris	male	1	A/5 21171	7.2500
2	1	1	Cumings, Mrs. John Bradley...	female	1	PC 17599	71.2833
3	1	3	Heikkinen, Miss. Laina	female	0	STON/O2. 3101282	7.9250
4	1	1	Futrelle, Mrs. Jacques ...	female	1	113803	53.1000
5	0	3	Allen, Mr. William Henry	male	0	373450	8.0500

Another option that you have is to set a threshold. So, for example, you can specify that if 11 of a row's 12 fields has a non-null value, you won't drop that row. You are now left with 733 entries from the original 891 - a definite improvement on 183 with dropping any rows with missing values. Modify the line above to confirm this is the case. Dropping missing values isn't a good option here. Let's try another of Panda's options, filling the missing values.

Modify the line below to set a threshold of 11 non-null values for a row.

<code class="sourceCode python">passengers.dropna().head(<span class="dv">5</span>)</code>

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
2	1	1	Cumings, Mrs. John Bradley ...	female	38	1	0	PC 17599	71.2833	C85	C
4	1	1	Futrelle, Mrs. Jacques Heath ...	female	35	1	0	113803	53.1000	C123	S
7	0	1	McCarthy, Mr. Timothy J	male	54	0	0	17463	51.8625	E46	S
11	1	3	Sandstrom, Miss. Marguerite ...	female	4	1	1	PP 9549	16.7000	G6	S
12	1	1	Bonnell, Miss. Elizabeth	female	58	0	0	113783	26.5500	C103	S

Here is the solution:

# Solution

passengers.dropna(thresh=11).head(5)

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley ...	female	38	1	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath ...	female	35	1	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	NaN	S

1.2 Filling the missing values

There are many statistical schools of thought on what missing or null values should be replaced with. You already know from passengers.describe() that the average(mean) age is 29.69. You can see that the 6th entry below have missing values replaced.

# The mean

passengers["Age"].fillna(value=29.69).head(10)

0    22.00

1    38.00

2    26.00

3    35.00

4    35.00

5    29.69

6    54.00

7     2.00

8    27.00

9    14.00

Name: Age, dtype: float64

The median is another good value that we could use. Modify the code above to use the median.

Complete the following line to fill in the missing values with the median age

<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span></code>

  File "<ipython-input-17-171bf02c8336>", line 1

    passengers["Age"].fillna(value=

                                   ^

SyntaxError: unexpected EOF while parsing

Here is the solution:

<code class="sourceCode python">passengers[<span class="st">"Age"</span>].fillna(value<span class="op">=</span>passengers[<span class="st">"Age"</span>].median()).head(<span class="dv">5</span>)</code>

0    22

1    38

2    26

3    35

4    35

Name: Age, dtype: float64

2 The apply function

This is one of the most powerful tools available in Pandas. Apply, allows you to either use Python's built-in functions or to create your own custom function and then run it across a set of your data.

Let's take a look at an example where we create own function and then "apply" that across a set of data.

The Titanic set off from Southampton, stopping off at Cherbourg and Queenstown.

I don't know about you, but I always struggle with remembering the names of French towns and ports. To help me out, I've written a function, get_full_city_name, that converts the one letter code of the Embarked column to its full name, for all the towns, French or otherwise . Using Panda's apply function, I can then run this throughout the entire column as shown below. Note that the function apply is not modifying the original data source but producing a new output instead.

def get_full_city_name(cityCode):

    if (cityCode == "S"):

        return "Southampton"

    elif (cityCode == "C"):

        return "Cherbourg"

    elif (cityCode == "Q"):

        return "Queenstown"

passengers["Embarked"].apply(get_full_city_name).head(10)

0    Southampton

1      Cherbourg

2    Southampton

3    Southampton

4    Southampton

5     Queenstown

6    Southampton

7    Southampton

8    Southampton

9      Cherbourg

Name: Embarked, dtype: object

Now let's use the apply function, but this time using a lambda expression. Lambda expressions allow you to create functions on the fly. This means you can create a function when you need it and then apply that to the relevant data set.

What would have been the price of the tickets on the Titanic in today's money? Taking into account inflation, prices have increased by about 106 times. In the one-liner below – and that's part of the reason why lambda functions are so useful – the lambda function takes as input the Fare from the data frame, multiplies this by 106 returning today's fare.

# Updating the fares for all the passengers

passengers["Fare"].apply(lambda fare: 106 * fare).head(5)

0     768.5000

1    7556.0298

2     840.0500

3    5628.6000

4     853.3000

Name: Fare, dtype: float64

What change would you have to make to the line below, to determine the maximum fare in today's money?

How can you modify the line below to show the highest fare?

<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).head(<span class="dv">5</span>)</code>

0     768.5000

1    7556.0298

2     840.0500

3    5628.6000

4     853.3000

Name: Fare, dtype: float64

Here is the solution:

<code class="sourceCode python">passengers[<span class="st">"Fare"</span>].<span class="bu">apply</span>(<span class="kw">lambda</span> fare: <span class="dv">106</span> <span class="op">*</span> fare).<span class="bu">max</span>()</code>

<code>54306.895199999999</code>

Want to find out more about the apply function? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html#pandas.DataFrame.apply

Recap

Dealing with missing values (dropna() and fillna()) gives you the freedom to determine how you will deal with missing data.
apply() enables you to run a function (a built-in one or one you have created) across a series of data.

I hope you have had fun exploring this data set. These are the top 3 Panda tips I recommend every Data Scientist has in their tool kit. Let me know in the comments section below if you think another Panda tool should have made the list.

About the Author

Jonathan Fernandes

Jonathan has an undergraduate in Computer Science and an MBA from Warwick University. Data Science combines his love for numbers, coding and statistics. He enjoys working together with people to make sense of their data, allowing them to make more informed decisions.

Enquire now

Fill out the following form and we’ll contact you within one business day to discuss and answer any questions you have about the programme. We look forward to speaking with you.

Talk to us about our Data & Ai programmes

Quick Panda tricks every Data Scientist should know

1 Dealing with missing values

1.1 Drop the missing values.

1.2 Filling the missing values

2 The apply function

Recap

Enquire now

Talk to us about our Data & Ai programmes