Introduction to Plotting
import pandas as pd
import seaborn as sns
%matplotlib inline
Single Variable: Continuous
Histogram
When you have a single continuous variable and want to visualise the distribution of its values in your dataset, a histogram is generally what you need. This groups the values into bins, where each bin is an interval within the range of values your variable can take. The x axis will show the interval of each bin, while the y axis shows the number of values in your dataset that fall within that interval.
Let’s load in some data using seaborn’s handy load_dataset()
function. The flights
dataset has three variables: two ordered categorical (year
, month
) and one continuous (number of passengers
).
Input:
flights = sns.load_dataset('flights')
flights.head(3)
Output:
A simple histogram will show the overall distribution of the passenger
variable. This is easy to plot, as pandas dataframes have a builtin method for generating it.
Input:
import pandas as pd
import seaborn as sns
%matplotlib inline
Output:
By default, pandas plots histograms using 10 bins but you could finetune this. Displaying more bins gives a more detailed overview of the distribution, up to a point: it all depends on how many observations you have overall and how they are distributed. You can see how using 20 bins shows more information about the distributions inside the larger 5 bins.
Input:
flights.passengers.hist(bins=5) # The blue bars
flights.passengers.hist(bins=20) # The orange bars
Output:
So the range of passenger numbers is a little over 100 to a bit over 600, with most flights towards the lower end. For a more precise overview, the describe
method for a dataframe’s columns will give general descriptive statistics.
Input:
flights.passengers.describe()
Output:
count 144.000000
mean 280.298611
std 119.966317
min 104.000000
25% 180.000000
50% 265.500000
75% 360.500000
max 622.000000
Name: passengers, dtype: float64
Boxplot
For a visual representation of describe
, a boxplot will show the minimum and maximum values (the left and right whiskers), the range of values covered by the 25th to 75th percentiles (the box) and the value of the median (the line inside the box).
Input:
sns.boxplot(x=flights.passengers)
Output:
Single Variable: Categorical
Bar chart
When you have a variable which takes on named, rather than numerical, values then the most common way of representing them is with a bar chart.
Here, we’ll load the titanic
dataset. Each row is a passenger on the ship, while the class
variable gives the class of that passenger’s ticket.
titanic = sns.load_dataset('titanic')
titanic['class'].value_counts()
Output:
Third 491
First 216
Second 184
Name: class, dtype: int64
You can chain .plot(kind='bar')
to the above value_counts()
method, but I prefer to use seaborn as you can directly pass it the original data. It will then do the counting for you and allow you more control over appearance. For example, if you do not like the ordering seaborn used for the x axis, then you can set it manually as a list e.g. order=['Third', 'Second', 'First']
Input:
sns.countplot(titanic['class'])
Output:
If you want to normalise the counts so as to see relative percentages rather than counts, then you just need to do that to the data before plotting it as a normal barplot.
Input:
titanic_normed = pd.DataFrame(titanic['class'].value_counts(normalize=True)).reset_index() sns.barplot(data=titanic_normed, x='index', y='class')
Output:
Plotting relationships between variables
Above, we only had a single variable. We examined it by looking at the frequency of values (the histogram) or by plotting descriptive statistics (the boxplot). But often we want to see how one variable is linked to another – as the value of one variable changes, what happens to the value of the other variable?
With continuous and ordered/unordered categorical variables, we have four possible combinations. Let’s look at them in turn.
Continuous x continuous
The mpg
dataset contains information about cars, measuring their weight, fuel efficiency and so on. We might expect heavier cars to have lower fuel efficiency.
When plotting continuous variables, the one you place on the xaxis should be the independent variable. This is generally some property or value we observe. The yaxis should display the dependent variable. This is a function of the values on the xaxis and is generally something we measure for each observed value on the xaxis. Here, we will place weight on the xaxis and miles per gallon on the yaxis.
Generally, the best choice of visualisation for this is a scatterplot. Each point represents the relation between a single value on the xaxis and its corresponding y value.
Input:
mpg = sns.load_dataset('mpg')
g = sns.scatterplot(data=mpg, x='weight', y='mpg')
Output:
There are several variations on this, which are made available through seaborn’s jointplot
. The default will add histograms on the margins, for each of the two variables.
Input:
mpg.head(3)
Output:
Input:
sns.jointplot(data=mpg, x='weight', y='mpg')
Output:
By setting the kind
argument to kde
, you can instead plot a joint kernel density estimate, with individual density estimates on the margins.
Input:
sns.jointplot(data=mpg, x='weight', y='mpg', kind='kde')
Output:
Or you can set it to hex
and plot the values as hexagons, which represent histogramtype bins. This can be very useful if you have a lot of observations in your dataset and plotting all those points is slow or messy.
Input:
sns.jointplot(data=mpg, x=‘weight’, y=‘mpg’, kind=‘hex’)
Output:
Continuous x unordered categorical
There are a few more options when it comes to jointly plotting continuous and categorical data. In general, the categorical data will go on the xaxis and you may need to change the order in which they are displayed.
Let’s look at the relationship between fuel efficiency (continuous) and a car’s country of origin (unordered categorical). Seaborn’s willstripplot
make a separate scatterplot for each categorical variable and place it on the x axis, with its own colour. It will also stagger the points a little to help see their distribution – this can be controlled with the jitter
argument.
Input:
sns.stripplot(data=mpg, x='origin', y='mpg', jitter=0.3)
Output:
The swarmplot
does the same but arranges the points so that there is no overlapping.
Input:
sns.swarmplot(data=mpg, x='origin', y='mpg')
Output:
And if you want a boxplot for each categorical variable, there is no need to do them separately and manually place them in a figure – catplot
is a great way to plot categorical x continuous data.
Input:
sns.catplot(data=mpg, x='origin', y='mpg', kind='box')
Output:
Continuous x ordered categorical
Sometimes, the categorical data will have a natural order to it. The most common of these is times or dates. This can sensibly be plotted as a line, to show how the continuous variable changes over time. Generally, the categorical data must be unique – no value should appear more than once.
The gammas
dataset contains fMRI measurements taken from multiple subjects. Let’s look at subject 0, and see how a signal which is dependent on blood oxygen levels (BOLD signal) changed over time in various regions of interest (ROI) in the brain.
Seaborn’s lineplot
method has a hue
argument, that will seperate out the three different values for ROI and plot them as their own lines.
Input:
gammas = sns.load_dataset('gammas')
subject_0_data = gammas[(gammas.subject == 0)]
sns.lineplot(data=subject_0_data, x='timepoint', y='BOLD signal', hue='ROI')
We could also focus on a particular ROI and then see how all subjects compare by setting hue="subject"
Input:
sns.lineplot(data=gammas[gammas.ROI == 'IPS'], x='timepoint', y='BOLD signal', hue='subject', legend=False)
# Remove the legend as it gets in the way with the default plot size.
Output:
Categorical x categorical
The most common nongraphical way of representing two joint categorical variables is as a contingency table. Each row of the table represents a possible value of one variable, the columns of the other variable. Cells are populated with the number of observations of pairs of those values.
We can create that table using pandas’ crosstab
function – just tell it which columns of a dataframe to use.
Input:
titanic = sns.load_dataset('titanic')
titanic.head(3)
sex_class = pd.crosstab(titanic.sex, titanic['class'])
sex_class
Output:
We can also normalise the values to show percentages, rather than counts.
Input:
sex_class_normed = pd.crosstab(titanic.sex, titanic['class'], normalize=True) * 100
sex_class_normed
Output:
This tabular data is easily to represent visually as a heatmap. This essentially colours in the cells of the table, based on their value. It can be a great way to very quickly communicate the joint distribution of two categorical variables, especially where you want to highlight the fact that some particular combinations are very high or low.
Input:
sns.heatmap(sex_class, cmap=‘Blues’, square=True, annot=True, fmt=‘g’)
Output:
Input:
sns.heatmap(sex_class_normed, cmap=‘Blues’, square=True, annot=True, fmt=‘.2f’, cbar=False)
Output:
Summary
Questions to ask before plotting
Here are the questions to ask before you start plotting:

What is the purpose of my visualisation?
 Show the relationship between variables?
 Illustrate individual distributions of variables?

What kind of variables do I have? For each variable:
 Is it continuous?
 Or is it categorical?

Besides these variables, is there some other informative distinction I want to show? Do my variables come from…
 different groups of people/individuals/companies/locations?
 different time periods?
 different experiments?
 different models?

Have I included all the necessary information?
 Descriptive title?
 Informative caption?
 Axes have suitable labels?
 Units for axes, where appropriate?
 Axes using suitable scale?
 Do I need a legend?
 Do my colours and styling aid readability?
Cheat sheet: picking a visualisation for your data
And a quick list, linking types of data to types of visualisation:
Single variable
 continuous
 histogram: more visual, big picture, show distribution of ranges of values
 boxplot: more statistical and detailed
 categorical
 barchart: show counts or proportions of values
Joint variables

continuous x continuous
 scatterplot: show relation between every x and y
 basic jointplot: as above, but with marginal histograms per variable
 kde jointplot: show distribution of joint values, with individual histograms
 hex jointplot: as above, but points are now minihistograms

continuous x unordered categorical
 stripplot: multiple scatterplots arranged on x axis
 swarmplot: as above, but no overlapping points allowed
 catplot with boxplots: replace individual plots with boxplots
 continuous x ordered categorical
 line: shows exactly what values are seen over time
 categorical x categorical
 crosstabulate then heatmap: show relative proportions of joint variables
Next steps

Look into seaborn’s documentation for figure aesthetics and choosing colour palettes – these can make your visualisations look really great. The ones I did here use the default settings and could definitely be improved upon!

Think about how the plots could be improved in terms of the questions under “Have I included all the necessary information?”. Seaborn makes it very easy to add titles and so on to figures.

Seaborn also makes it easy to visualise many aspects of the data at once, rather than individually as we did here. Read the documentation for jointplot and catplot to see how flexible and easy to use these methods are!

Try applying the above to real data that you have, rather than the toy datasets used here.
About the author
Alexander Robertson is a Data Science PhD student at the University of Edinburgh, where his research focuses on variation, usage and change in natural language and also emoji.