# Deep learning for complete beginners: recognising handwritten digits

*[Edited on 19 March 2017, to account for API changes introduced by the release of Keras 2]*

## Introduction

Welcome to the first in a series of blog posts that is designed to get you quickly up to speed with *deep learning*; from first principles, all the way to discussions of some of the intricate details, with the purposes of achieving respectable performance on two established machine learning benchmarks: MNIST (classification of handwritten digits) and CIFAR-10 (classification of small images across 10 distinct classes – airplane, automobile, bird, cat, deer, dog, frog, horse, ship & truck).

MNIST | CIFAR-10 |
---|---|

The accelerated growth of deep learning has lead to the development of several very convenient frameworks, which allow us to rapidly construct and prototype our models, as well as offering a no-hassle access to established benchmarks such as the aforementioned two. The particular environment we will be using is *Keras*, which I've found to be the most convenient and intuitive for essential use, but still expressive enough to allow detailed model tinkering when it is necessary.

By the end of this part of the tutoral, you should be capable of understanding and producing a simple multilayer perceptron (MLP) deep learning model in Keras, achieving a respectable level of accuracy on MNIST. The next tutorial in the series will explore techniques for handling larger image classification tasks (such as CIFAR-10).

## (Artificial) neurons

While the term "deep learning" allows for a broader interpretation, in pratice, for a vast majority of cases, it is applied to the model of (artificial) *neural networks*. These biologically inspired structures attempt to mimic the way in which the neurons in the brain process percepts from the environment and drive decision-making. In fact, a single artificial neuron (sometimes also called a *perceptron*) has a very simple mode of operation – it computes a weighted sum of all of its *inputs* \(\vec{x}\), using a *weight vector* \(\vec{w}\) (along with an additive *bias* term, \(w_0\)), and then potentially applies an *activation function*, \(\sigma\), to the result.

Some of the popular choices for activation functions include (plots given below):

**identity**: \(\sigma(z) = z\);**sigmoid**: especially the*logistic*function, \(\sigma(z) = \frac{1}{1 + \exp(-z)}\), and the*hyperbolic tangent*, \(\sigma(z) = \tanh z\);**rectified linear (ReLU)**: \(\sigma(z) = \max(0, z)\).

Original perceptron models (from the 1950s) were fully linear, i.e. they only employed the identity activation. It soon became evident that tasks of interests are often nonlinear in nature, which lead to usage of other activation functions. Sigmoid functions (owing their name to their characteristic "S" shaped plot) provide a nice way to encode initial "uncertainty" of a neuron in a binary decision, when \(z\) is close to zero, coupled with quick saturation as \(z\) shifts in either direction. The two functions presented here are very similar, with the hyperbolic tangent giving outputs within \([-1, 1]\), and the logistic function giving outputs within \([0, 1]\) (and therefore being useful for representing probabilities).

In recent years, ReLU activations (and variations thereof) have become ubiquitous in deep learning – they started out as a simple, "engineer's" way to inject nonlinearity into the model ("if it's negative, set it to zero"), but turned out to be far more successful than the historically more popular sigmoid activations, and also have been linked to the way physical neurons transmit electrical potential. As such, we will focus exclusively on them in this tutorial.

A neuron is completely specified by its weight vector \(\vec{w}\), and the key aim of a *learning algorithm* is to assign appropriate weights to the neuron based on a *training set* of known input/output pairs, such that the notion of a "predictive error/loss" will be minimised when applying the inputs within the training set to the neuron. One typical example of such a learning algorithm is *gradient descent*, which will, for a particular loss function \(E(\vec{w})\), update the weight vector in the direction of steepest descent of the loss function, scaled by a *learning rate* parameter \(\eta\):

\[\vec{w} \leftarrow \vec{w} - \eta \frac{\partial E(\vec{w})}{\partial \vec{w}}\]

The loss function represents our belief of how "incorrect" the neuron is at making predictions under its current parameter values. The simplest such choice of a loss function (that usually works best for general environments) is the *squared error loss*; for a particular training example \((\vec{x}, y)\) it is defined as the squared difference between the ground-truth label \(y\) and the output of the neuron when given \(\vec{x}\) as input:

\[E(\vec{w}) = \left(y - \sigma\left(w_0 + \sum_{i=1}^n w_ix_i\right)\right)^2\]

There are many excellent tutorials online that provide a more in-depth overview of gradient descent algorithms – one of which may be found on this website! Here the framework will take care of the optimisation routines for us, and therefore I will not dedicate further attention to them.

## Enter neural networks (& deep learning)

Once we have a notion of a neuron, it is possible to connect outputs of neurons to inputs of other neurons, giving rise to *neural networks*. In general we will focus on *feedforward* neural networks, where these neurons are typically organised in layers, such that the neurons within a single layer process the outputs of the previous layer. The most potent of such architectures (a *multilayer perceptron* or MLP) fully connects all outputs of a layer to all the neurons in the following layer, as illustrated below.

The output neurons' weights can be updated by direct application of the previously mentioned gradient descent on a given loss function – for other neurons these losses need to be propagated backwards (by applying the chain rule for partial differentiation), thus giving rise to the *backpropagation* algorithm. Similarly as for the basic gradient descent algorithm, I will not focus on the mathematical derivations of the algorithm here, as the framework will be taking care of it for us.

By Cybenko's universal approximation theorem, a (wide enough) MLP with a single hidden layer of sigmoid neurons is capable of approximating *any* continuous real function on a bounded interval; however, the proof of this theorem is not constructive, and therefore does not offer an efficient training algorithm for learning such structures in general. Deep learning represents a response to this: rather than increasing the *width*, increase the *depth*; by definition, *any* neural network with more than one hidden layer is considered deep.

The shift in depth also often allows us to directly feed *raw input data* into the network; in the past, single-layer neural networks were ran on *features* extracted from the input by carefully crafted feature functions. This meant that significantly different approaches were needed for, e.g. the problems of *computer vision*, *speech recognition* and *natural language processing*, impeding scientific collaboration across these fields. However, when a network has multiple hidden layers, it gains the capability to learn the feature functions that best describe the raw data *by itself*, thus being applicable to end-to-end learning and allowing one to use the same kind of networks across a *wide variety* of tasks, eliminating the need for designing feature functions from the pipeline. I will demonstrate graphical evidence of this in the second part of this tutorial, when we will explore convolutional neural networks (CNNs).

## Applying a deep MLP to MNIST

As this post's objective, we will implement the simplest possible deep neural network – an MLP with two hidden layers – and apply it on the MNIST handwritten digit recognition task.

Only the following imports are required:

```
from keras.datasets import mnist # subroutines for fetching the MNIST dataset
from keras.models import Model # basic class for specifying and training a neural network
from keras.layers import Input, Dense # the two types of neural network layer we will be using
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
```

Next up, we'll define some parameters of our model. These are often called *hyperparameters*, because they are assumed to be fixed before training starts. For the purposes of this tutorial, we will stick to using some sensible values, but keep in mind that properly training them is a significant issue, which will be addressed more properly in a future tutorial.

In particular, we will define: - The *batch size*, representing the number of training examples being used simultaneously during a single iteration of the gradient descent algorithm; - The number of *epochs*, representing the number of times the training algorithm will iterate over the entire training set before terminating; - The *number of neurons* in each of the two hidden layers of the MLP.

```
batch_size = 128 # in each iteration, we consider 128 training examples at once
num_epochs = 20 # we iterate twenty times over the entire training set
hidden_size = 512 # there will be 512 neurons in both hidden layers
```

Now it is time to load and preprocess the MNIST data set. Keras makes this extremely simple, with a fixed interface for fetching and extracting the data from the remote server, directly into NumPy arrays.

To preprocess the input data, we will first flatten the images into 1D (as we will consider each pixel as a separate input feature), and we will then force the pixel intensity values to be in the \([0, 1]\) range by dividing them by \(255\). This is a very simple way to "normalise" the data, and I will be discussing other ways in future tutorials in this series.

A good approach to a classification problem is to use *probabilistic classification*, i.e. to have a single output neuron for each class, outputting a value which corresponds to the probability of the input being of that particular class. This implies a need to transform the training output data into a "one-hot" encoding: for example, if the desired output class is \(3\), and there are five classes overall (labelled \(0\) to \(4\)), then an appropriate one-hot encoding is: \([0\ 0\ 0\ 1\ 0]\). Keras, once again, provides us with an out-of-the-box functionality for doing just that.

```
num_train = 60000 # there are 60000 training examples in MNIST
num_test = 10000 # there are 10000 test examples in MNIST
height, width, depth = 28, 28, 1 # MNIST images are 28x28 and greyscale
num_classes = 10 # there are 10 classes (1 per digit)
(X_train, y_train), (X_test, y_test) = mnist.load_data() # fetch MNIST data
X_train = X_train.reshape(num_train, height * width) # Flatten data to 1D
X_test = X_test.reshape(num_test, height * width) # Flatten data to 1D
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255 # Normalise data to [0, 1] range
X_test /= 255 # Normalise data to [0, 1] range
Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels
Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels
```

Now is the time to actually define our model! To do this we will be using a stack of three `Dense`

layers, which correspond to a fully unrestricted MLP structure, linking all of the outputs of the previous layer to the inputs of the next one. We will use ReLU activations for the neurons in the first two layers, and a *softmax* activation for the neurons in the final one. This activation is designed to turn any real-valued vector into a vector of probabilities, and is defined as follows, for the \(j\)-th neuron:

\[\sigma(\vec{z})_j = \frac{\exp(z_j)}{\sum_i \exp(z_i)}\]

An excellent feature of Keras, that sets it apart from frameworks such as TensorFlow, is *automatic inference of shapes*; we only need to specify the shape of the input layer, and afterwards Keras will take care of initialising the weight variables with proper shapes. Once all the layers have been defined, we simply need to identify the input(s) and the output(s) in order to define our model, as illustrated below.

```
inp = Input(shape=(height * width,)) # Our input is a 1D vector of size 784
hidden_1 = Dense(hidden_size, activation='relu')(inp) # First hidden ReLU layer
hidden_2 = Dense(hidden_size, activation='relu')(hidden_1) # Second hidden ReLU layer
out = Dense(num_classes, activation='softmax')(hidden_2) # Output softmax layer
model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers
```

To finish off specifying the model, we need to define our loss function, the optimisation algorithm to use, and which metrics to report.

When dealing with probabilistic classification, it is actually better to use the *cross-entropy* loss, rather than the previously defined squared error. For a particular output probability vector \(\vec{y}\), compared with our (ground truth) one-hot vector \(\vec{\hat{y}}\), the loss (for \(k\)-class classification) is defined by

\[\mathcal{L}(\vec{y}, \vec{\hat{y}}) = - \sum_{i=1}^k \hat{y}_i \ln y_i\]

This loss is better for probabilistic tasks (i.e. ones with logistic/softmax output neurons), primarily because of its manner of derivation – it aims *only* to maximise the model's confidence in the *correct* class, and is not concerned with the distribution of probabilities for other classes (while the squared error loss would dedicate equal attention to getting all of the other class probabilities as close to zero as possible). This is due to the fact that incorrect classes, i.e. classes \(i'\) with \(\hat{y}_{i'} = 0\), eliminate the respective neuron's output from the loss function.

The optimisation algorithm used will typically revolve around some form of gradient descent; their key differences revolve around the manner in which the previously mentioned *learning rate*, \(\eta\), is chosen or adapted during training. An excellent overview of such approaches is given by this blog post; here we will use the Adam optimiser, which typically performs well.

As our classes are balanced (there is an equal amount of handwritten digits across all ten classes), an appropriate metric to report is the *accuracy*; the proportion of the inputs classified correctly.

```
model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
optimizer='adam', # using the Adam optimiser
metrics=['accuracy']) # reporting the accuracy
```

Finally, we call the training algorithm with the determined batch size and epoch count. It is good practice to set aside a fraction of the training data to be used just for verification that our algorithm is (still) properly generalising (this is commonly referred to as the *validation set*); here we will hold out \(10\%\) of the data for this purpose.

An excellent out-of-the-box feature of Keras is *verbosity*; it's able to provide detailed real-time pretty-printing of the training algorithm's progress.

```
model.fit(X_train, Y_train, # Train the model using the training set...
batch_size=batch_size, epochs=num_epochs,
verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation
model.evaluate(X_test, Y_test, verbose=1) # Evaluate the trained model on the test set!
```

```
Train on 54000 samples, validate on 6000 samples
Epoch 1/20
54000/54000 [==============================] - 9s - loss: 0.2295 - acc: 0.9325 - val_loss: 0.1093 - val_acc: 0.9680
Epoch 2/20
54000/54000 [==============================] - 9s - loss: 0.0819 - acc: 0.9746 - val_loss: 0.0922 - val_acc: 0.9708
Epoch 3/20
54000/54000 [==============================] - 11s - loss: 0.0523 - acc: 0.9835 - val_loss: 0.0788 - val_acc: 0.9772
Epoch 4/20
54000/54000 [==============================] - 12s - loss: 0.0371 - acc: 0.9885 - val_loss: 0.0680 - val_acc: 0.9808
Epoch 5/20
54000/54000 [==============================] - 12s - loss: 0.0274 - acc: 0.9909 - val_loss: 0.0772 - val_acc: 0.9787
Epoch 6/20
54000/54000 [==============================] - 12s - loss: 0.0218 - acc: 0.9931 - val_loss: 0.0718 - val_acc: 0.9808
Epoch 7/20
54000/54000 [==============================] - 12s - loss: 0.0204 - acc: 0.9933 - val_loss: 0.0891 - val_acc: 0.9778
Epoch 8/20
54000/54000 [==============================] - 13s - loss: 0.0189 - acc: 0.9936 - val_loss: 0.0829 - val_acc: 0.9795
Epoch 9/20
54000/54000 [==============================] - 14s - loss: 0.0137 - acc: 0.9950 - val_loss: 0.0835 - val_acc: 0.9797
Epoch 10/20
54000/54000 [==============================] - 13s - loss: 0.0108 - acc: 0.9969 - val_loss: 0.0836 - val_acc: 0.9820
Epoch 11/20
54000/54000 [==============================] - 13s - loss: 0.0123 - acc: 0.9960 - val_loss: 0.0866 - val_acc: 0.9798
Epoch 12/20
54000/54000 [==============================] - 13s - loss: 0.0162 - acc: 0.9951 - val_loss: 0.0780 - val_acc: 0.9838
Epoch 13/20
54000/54000 [==============================] - 12s - loss: 0.0093 - acc: 0.9968 - val_loss: 0.1019 - val_acc: 0.9813
Epoch 14/20
54000/54000 [==============================] - 12s - loss: 0.0075 - acc: 0.9976 - val_loss: 0.0923 - val_acc: 0.9818
Epoch 15/20
54000/54000 [==============================] - 12s - loss: 0.0118 - acc: 0.9965 - val_loss: 0.1176 - val_acc: 0.9772
Epoch 16/20
54000/54000 [==============================] - 12s - loss: 0.0119 - acc: 0.9961 - val_loss: 0.0838 - val_acc: 0.9803
Epoch 17/20
54000/54000 [==============================] - 12s - loss: 0.0073 - acc: 0.9976 - val_loss: 0.0808 - val_acc: 0.9837
Epoch 18/20
54000/54000 [==============================] - 13s - loss: 0.0082 - acc: 0.9974 - val_loss: 0.0926 - val_acc: 0.9822
Epoch 19/20
54000/54000 [==============================] - 12s - loss: 0.0070 - acc: 0.9979 - val_loss: 0.0808 - val_acc: 0.9835
Epoch 20/20
54000/54000 [==============================] - 11s - loss: 0.0039 - acc: 0.9987 - val_loss: 0.1010 - val_acc: 0.9822
10000/10000 [==============================] - 1s
[0.099321320021623111, 0.9819]
```

As can be seen, our model achieves an accuracy of \(\sim 98.2\%\) on the test set; this is quite respectable for such a simple model, despite being outclassed by state-of-the-art approaches enumerated here.

I encourage you to play around with this model: attempt different hyperparameter values/optimisation algorithms/activation functions, add more hidden layers, etc. Eventually, you should be able to achieve accuracies above \(99\%\).

## Conclusion

Throughout this post we have covered the essentials of deep learning, and successfully implemented a simple two-layer deep MLP in Keras, applying it to MNIST, all in under 30 lines of code.

Next time around, we will explore convolutional neural networks (CNNs), resolving some of the issues posed by applying MLPs to larger image tasks (such as CIFAR-10).

ABOUT THE AUTHOR

Petar is currently a Research Assistant in Computational Biology within the Artificial Intelligence Group of the Cambridge University Computer Laboratory, where he is working on developing machine learning algorithms on complex networks, and their applications to bioinformatics. He is also a PhD student within the group, supervised by Dr Pietro Liò and affiliated with Trinity College. He holds a BA degree in Computer Science from the University of Cambridge, having completed the Computer Science Tripos in 2015.

## Just show me the code!

```
from keras.datasets import mnist # subroutines for fetching the MNIST dataset
from keras.models import Model # basic class for specifying and training a neural network
from keras.layers import Input, Dense # the two types of neural network layer we will be using
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
batch_size = 128 # in each iteration, we consider 128 training examples at once
num_epochs = 20 # we iterate twenty times over the entire training set
hidden_size = 512 # there will be 512 neurons in both hidden layers
num_train = 60000 # there are 60000 training examples in MNIST
num_test = 10000 # there are 10000 test examples in MNIST
height, width, depth = 28, 28, 1 # MNIST images are 28x28 and greyscale
num_classes = 10 # there are 10 classes (1 per digit)
(X_train, y_train), (X_test, y_test) = mnist.load_data() # fetch MNIST data
X_train = X_train.reshape(num_train, height * width) # Flatten data to 1D
X_test = X_test.reshape(num_test, height * width) # Flatten data to 1D
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255 # Normalise data to [0, 1] range
X_test /= 255 # Normalise data to [0, 1] range
Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels
Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels
inp = Input(shape=(height * width,)) # Our input is a 1D vector of size 784
hidden_1 = Dense(hidden_size, activation='relu')(inp) # First hidden ReLU layer
hidden_2 = Dense(hidden_size, activation='relu')(hidden_1) # Second hidden ReLU layer
out = Dense(num_classes, activation='softmax')(hidden_2) # Output softmax layer
model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers
model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
optimizer='adam', # using the Adam optimiser
metrics=['accuracy']) # reporting the accuracy
model.fit(X_train, Y_train, # Train the model using the training set...
batch_size=batch_size, epochs=num_epochs,
verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation
model.evaluate(X_test, Y_test, verbose=1) # Evaluate the trained model on the test set!
```