# Deep learning for complete beginners: convolutional neural networks with keras

[Edited on 20 March 2017, to account for API changes introduced by the release of Keras 2]

## Introduction

Welcome to the second in a series of blog posts that is designed to get you quickly up to speed with deep learning; from first principles, all the way to discussions of some of the intricate details, with the purposes of achieving respectable performance on two established machine learning benchmarks: MNIST (classification of handwritten digits) and CIFAR-10 (classification of small images across 10 distinct classes – airplane, automobile, bird, cat, deer, dog, frog, horse, ship & truck).

MNIST CIFAR-10

Last time around, I have introduced the fundamental concepts of deep learning, and illustrated how models can be rapidly developed and prototyped by leveraging the Keras deep learning framework. Ultimately, a two-layer multilayer perceptron (MLP) was applied to MNIST, achieving an accuracy level of 98.2%, which can be quite easily improved upon. But ultimately, fully connected MLPs will usually not be the model of choice for image-related tasks – it is far more typical to make advantage of a convolutional neural network (CNN) in this case. By the end of this part of the tutoral, you should be capable of understanding and producing a simple CNN in Keras, achieving a respectable level of accuracy on CIFAR-10.

This tutorial will, for the most part, assume familiarity with the previous one in the series.

## Image processing

The previously mentioned multilayer perceptrons represent the most general and powerful feedforward neural network model possible; they are organised in layers, such that every neuron within a layer receives its own copy of all the outputs of the previous layer as its input. This kind of model is perfect for the right kind of problem – learning from a fixed number of (more or less) unstructured parameters.

However, consider what happens to the number of parameters (weights) of such a model when being fed raw image data. CIFAR-10, for example, contains 32×32×3 coloured images: if we are to treat each channel of each pixel as an independent input to an MLP, each neuron of the first hidden layer adds $$\sim$$ 3000 new parameters to the model! The situation quickly becomes unmanageable as image sizes grow larger, way before reaching the kind of images people usually want to work with in real applications.

A common solution is to downsample the images to a size where MLPs can safely be applied. However, if we directly downsample the image, we potentially lose a wealth of information; it would be great if we would somehow be able to still do some useful (without causing an explosion in parameter count) processing of the image, prior to performing the downsampling.

## Convolutions

It turns out that there is a very efficient way of pulling this off, and it makes advantage of the structure of the information encoded within an image – it is assumed that pixels that are spatially closer together will "cooperate" on forming a particular feature of interest much more than ones on opposite corners of the image. Also, if a particular (smaller) feature is found to be of great importance when defining an image's label, it will be equally important if this feature was found anywhere within the image, regardless of location.

Enter the convolution operator. Given a two-dimensional image, $$I$$, and a small matrix, $$K$$ of size $$h \times w$$, (known as a convolution kernel), which we assume encodes a way of extracting an interesting image feature, we compute the convolved image, $$I * K$$, by overlaying the kernel on top of the image in all possible ways, and recording the sum of elementwise products between the image and the kernel:

$(I * K)_{xy} = \sum_{i=1}^h \sum_{j=1}^w {K_{ij} \cdot I_{x + i - 1, y + j - 1}}$

(In fact, the exact definition would require us to flip the kernel matrix first, but for the purposes of machine learning it is irrelevant whether this is done)

The images below show a diagrammatical overview of the above formula and the result of applying convolution (with two separate kernels) over an image, to act as an edge detector:

## Convolutional and pooling layers

The convolution operator forms the fundamental basis of the convolutional layer of a CNN. The layer is completely specified by a certain number of kernels, $$\vec{K}$$ (along with additive biases, $$\vec{b}$$, per each kernel), and it operates by computing the convolution of the output images of a previous layer with each of those kernels, afterwards adding the biases (one per each output image). Finally, an activation function, $$\sigma$$, may be applied to all of the pixels of the output images. Typically, the input to a convolutional layer will have $$d$$ channels (e.g., red/green/blue in the input layer), in which case the kernels are extended to have this number of channels as well, making the final formula of a single output image channel of a convolutional layer (for a kernel $$K$$ and bias $$b$$) as follows:

$\mathrm{conv}(I, K)_{xy} = \sigma\left(b + \sum_{i=1}^h \sum_{j=1}^w \sum_{k=1}^d {K_{ijk} \cdot I_{x + i - 1, y + j - 1, k}}\right)$

Note that, since all we're doing here is addition and scaling of the input pixels, the kernels may be learned from a given training dataset via gradient descent, exactly as the weights of an MLP. In fact, an MLP is perfectly capable of replicating a convolutional layer, but it would require a lot more training time (and data) to learn to approximate that mode of operation.

Finally, let's just note that a convolutional operator is in no way restricted to two-dimensionally structured data: in fact, most machine learning frameworks (Keras included) will provide you with out-of-the-box layers for 1D and 3D convolutions as well!

It is important to note that, while a convolutional layer significantly decreases the number of parameters compared to a fully connected (FC) layer, it introduces more hyperparameters – parameters whose values need to be chosen before training starts.

Namely, the hyperparameters to choose within a single convolutional layer are: - depth: how many different kernels (and biases) will be convolved with the output of the previous layer; - height and width of each kernel; - stride: by how much we shift the kernel in each step to compute the next pixel in the result. This specifies the overlap between individual output pixels, and typically it is set to $$1$$, corresponding to the formula given before. Note that larger strides result in smaller output sizes. - padding: note that convolution by any kernel larger than $$1\times 1$$ will decrease the output image size – it is often desirable to keep sizes the same, in which case the image is sufficiently padded with zeroes at the edges. This is often called "same" padding, as opposed to "valid" (no) padding. It is possible to add arbitrary levels of padding, but typically the padding of choice will be either same or valid.

As already hinted, convolutions are not typically meant to be the sole operation in a CNN (although there have been promising recent developments on all-convolutional networks); but rather to extract useful features of an image prior to downsampling it sufficiently to be manageable by an MLP.

A very popular approach to downsampling is a pooling layer, which consumes small and (usually) disjoint chunks of the image (typically $$2\times 2$$) and aggregates them into a single value. There are several possible schemes for the aggregation – the most popular being max-pooling, where the maximum pixel value within each chunk is taken. A diagrammatical illustration of $$2\times 2$$ max-pooling is given below.

## Putting it all together: a common CNN

Now that we got all the building blocks, let's see what a typical convolutional neural network might look like!

A typical CNN architecture for a $$k$$-class image classification can be split into two distinct parts – a chain of repeating $$\mathrm{Conv}\rightarrow\mathrm{Pool}$$ layers (sometimes with more than one convolutional layer at once), followed by a few fully connected layers (taking each pixel of the computed images as an independent input), culminating in a $$k$$-way softmax layer, to which a cross-entropy loss is optimised. I did not draw the activation functions here to make the sketch clearer, but do keep in mind that typically after every convolutional or fully connected layer, an activation (e.g., ReLU) will be applied to all of the outputs.

Note the effect of a single $$\mathrm{Conv}\rightarrow\mathrm{Pool}$$ pass through the image: it reduces height and width of the individual channels in favour of their number, i.e., depth.

The softmax layer and cross-entropy loss are both introduced in more detail in the previous tutorial. For summarisation purposes, a softmax layer's purpose is converting any vector of real numbers into a vector of probabilities (nonnegative real values that add up to 1). Within this context, the probabilities correspond to the likelihoods that an input image is a member of a particular class. Minimising the cross-entropy loss has the effect of maximising the model's confidence in the correct class, without being concerned for the probabilites for other classes – this makes it a more suitable choice for probabilistic tasks compared to, for example, the squared error loss.

## Detour: Overfitting, regularisation and dropout

This will be the first (and hopefully the only) time when I will divert your attention to a seemingly unrelated topic. It regards a very important pitfall of machine learning – overfitting a model to the training data. While this is primarily going to be a major topic of the next tutorial in the series, the negative effects of overfitting will tend to become quite noticeable on the networks like the one we are about to build, and we need to introduce a way to properly protect ourselves against it, before going any further. Luckily, there is a very simple technique we can use.

Overfitting corresponds to adapting our model to the training set to such extremes that its generalisation potential (performance on samples outside of the training set) is severely limited. In other words, our model might have learned the training set (along with any noise present within it) perfectly, but it has failed to capture the underlying process that generated it. To illustrate, consider a problem of fitting a sine curve, with white additive noise applied to the data points:

Here we have a training set (denoted by blue circles) derived from the original sine wave, along with some noise. If we fit a degree-3 polynomial to this data, we get a fairly good approximation to the original curve. Someone might argue that a degree-14 polynomial would do better; indeed, given we have 15 points, such a fit would perfectly describe the training data. However, in this case, the additional parameters of the model cause catastrophic results: to cope with the inherent noise of the data, anywhere except in the closest vicinity of the training points, our fit is completely off.

Deep convolutional neural networks have a large number of parameters, especially in the fully connected layers. Overfitting might often manifest in the following form: if we don't have sufficiently many training examples, a small group of neurons might become responsible for doing most of the processing and other neurons becoming redundant; or in the other extreme, some neurons might actually become detrimental to performance, with several other neurons of their layer ending up doing nothing else but correcting for their errors.

To help our models generalise better in these circumstances, we introduce techniques of regularisation: rather than reducing the number of parameters, we impose constraints on the model parameters during training to keep them from learning the noise in the training data. The particular method I will introduce here is dropout – a technique that initially might seem like "dark magic", but actually helps to eliminate exactly the failure modes described above. Namely, dropout with parameter $$p$$ will, within a single training iteration, go through all neurons in a particular layer and, with probability $$p$$, completely eliminate them from the network throughout the iteration. This has the effect of forcing the neural network to cope with failures, and not to rely on existence of a particular neuron (or set of neurons) – relying more on a consensus of several neurons within a layer. This is a very simple technique that works quite well already for combatting overfitting on its own, without introducing further regularisers. An illustration is given below.

## Applying a deep CNN to CIFAR-10

As this post's objective, we will implement a deep convolutional neural network – and apply it on the CIFAR-10 image classification task.

Imports are largely similar to last time, apart from the fact that we will be using a wider variety of layers:

from keras.datasets import cifar10 # subroutines for fetching the CIFAR-10 dataset
from keras.models import Model # basic class for specifying and training a neural network
from keras.layers import Input, Convolution2D, MaxPooling2D, Dense, Dropout, Flatten
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
import numpy as np
Using TensorFlow backend.

As already mentioned, a CNN will typically have more hyperparameters than an MLP. For the purposes of this tutorial, we will also stick to "sensible" hand-picked values for them, but do still keep in mind that later on I will introduce a more proper method for learning them.

The hyperparameters are:

• The batch size, representing the number of training examples being used simultaneously during a single iteration of the gradient descent algorithm;
• The number of epochs, representing the number of times the training algorithm will iterate over the entire training set before terminating1;
• The kernel sizes in the convolutional layers;
• The pooling size in the pooling layers;
• The number of kernels in the convolutional layers;
• The dropout probability (we will apply dropout after each pooling, and after the fully connected layer);
• The number of neurons in the fully connected layer of the MLP.
batch_size = 32 # in each iteration, we consider 32 training examples at once
num_epochs = 200 # we iterate 200 times over the entire training set
kernel_size = 3 # we will use 3x3 kernels throughout
pool_size = 2 # we will use 2x2 pooling throughout
conv_depth_1 = 32 # we will initially have 32 kernels per conv. layer...
conv_depth_2 = 64 # ...switching to 64 after the first pooling layer
drop_prob_1 = 0.25 # dropout after pooling with probability 0.25
drop_prob_2 = 0.5 # dropout in the FC layer with probability 0.5
hidden_size = 512 # the FC layer will have 512 neurons

Loading and preprocessing the CIFAR-10 dataset is done in exactly the same way as for MNIST, with Keras routines doing most of the work. The sole difference is that now we do not initially consider each pixel an independent input feature, and therefore we do not reshape the input to 1D. We will once again force the pixel intensity values to be in the $$[0, 1]$$, and use a one-hot encoding for the output labels.

However, this time around, this stage will be done in a more general way, to allow you to adapt it more easily to new datasets: the sizes will be extracted from the dataset rather than hardcoded, the number of classes is inferred from the number of unique labels in the training set, and the normalisation is performed via division by the maximum value in the training set.

N.B.: we will divide the testing set by the maximum of the training set, because our algorithms are not allowed to see the testing data before the learning process is complete, and therefore we are not allowed to compute any statistics on it, other than performing transformations derived entirely from the training set.

(X_train, y_train), (X_test, y_test) = cifar10.load_data() # fetch CIFAR-10 data

num_train, height, width, depth = X_train.shape # there are 50000 training examples in CIFAR-10
num_test = X_test.shape[0] # there are 10000 test examples in CIFAR-10
num_classes = np.unique(y_train).shape[0] # there are 10 image classes

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= np.max(X_train) # Normalise data to [0, 1] range
X_test /= np.max(X_test) # Normalise data to [0, 1] range

Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels
Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels

Modelling time! Our network will consist of four Convolution2D layers, with a MaxPooling2D layer following after the second and the fourth convolution. After the first pooling layer, we double the number of kernels (in line with the previously mentioned principle of sacrificing height and width for more depth). Afterwards, the output of the second pooling layer is flattened to 1D (via the Flatten layer), and passed through two fully connected (Dense) layers. ReLU activations will once again be used for all layers except the output dense layer, which will use a softmax activation (for purposes of probabilistic classification).

To regularise our model, a Dropout layer is applied after each pooling layer, and after the first Dense layer. This is another area where Keras shines compared to other frameworks: it has an internal flag that automatically enables or disables dropout, depending on whether the model is currently used for training or testing.

The remainder of the model specification exactly matches our previous setup for MNIST: - We use the cross-entropy loss function as the objective to optimise (as its derivation is more appropriate for probabilistic tasks); - We use the Adam optimiser for gradient descent; - We report the accuracy2 of the model (as the dataset is balanced across the ten classes); - We hold out 10% of the data for validation purposes.

inp = Input(shape=(height, width, depth)) # depth goes last in TensorFlow back-end (first in Theano)
# Conv [32] -> Conv [32] -> Pool (with dropout on the pooling layer)
conv_1 = Convolution2D(conv_depth_1, (kernel_size, kernel_size), padding='same', activation='relu')(inp)
conv_2 = Convolution2D(conv_depth_1, (kernel_size, kernel_size), padding='same', activation='relu')(conv_1)
pool_1 = MaxPooling2D(pool_size=(pool_size, pool_size))(conv_2)
drop_1 = Dropout(drop_prob_1)(pool_1)
# Conv [64] -> Conv [64] -> Pool (with dropout on the pooling layer)
conv_3 = Convolution2D(conv_depth_2, (kernel_size, kernel_size), padding='same', activation='relu')(drop_1)
conv_4 = Convolution2D(conv_depth_2, (kernel_size, kernel_size), padding='same', activation='relu')(conv_3)
pool_2 = MaxPooling2D(pool_size=(pool_size, pool_size))(conv_4)
drop_2 = Dropout(drop_prob_1)(pool_2)
# Now flatten to 1D, apply FC -> ReLU (with dropout) -> softmax
flat = Flatten()(drop_2)
hidden = Dense(hidden_size, activation='relu')(flat)
drop_3 = Dropout(drop_prob_2)(hidden)
out = Dense(num_classes, activation='softmax')(drop_3)

model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers

model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
metrics=['accuracy']) # reporting the accuracy

model.fit(X_train, Y_train,                # Train the model using the training set...
batch_size=batch_size, epochs=num_epochs,
verbose=1, validation_split=0.1) # ...holding out 10% of the data for validation
model.evaluate(X_test, Y_test, verbose=1)  # Evaluate the trained model on the test set!

This model achieves an accuracy of $$\sim 78.6\%$$ on the test set; for such a difficult task (where human performance is only around $$94\%$$), and given the relative simplicity of this model, this is a respectable result. However, more sophisticated models have recently been able to get as far as $$96.53\%$$.

I appreciate that tinkering with this model might be cumbersome if you do not have a GPU in your possession. I would, however, encourage you to apply a similar model to the previously discussed MNIST dataset; you should be able to break $$99.3\%$$ accuracy on its test set with little to no effort using a CNN with dropout.

## Conclusion

Throughout this post we have covered the essentials of convolutional neural networks, introduced the problem of overfitting, and made a very brief dent into how it could be rectified via regularisation (by applying dropout) and successfully implemented a four-layer deep CNN in Keras, applying it to CIFAR-10, all in under 50 lines of code.

Next time around, we will focus on some assorted topics, tips and tricks that should help you when fine-tuning models at this scale, and extracting more power out of your models while keeping overfitting in check.

Petar is currently a Research Assistant in Computational Biology within the Artificial Intelligence Group of the Cambridge University Computer Laboratory, where he is working on developing machine learning algorithms on complex networks, and their applications to bioinformatics. He is also a PhD student within the group, supervised by Dr Pietro Liò and affiliated with Trinity College. He holds a BA degree in Computer Science from the University of Cambridge, having completed the Computer Science Tripos in 2015.

## Just show me the code!

from keras.datasets import cifar10 # subroutines for fetching the CIFAR-10 dataset
from keras.models import Model # basic class for specifying and training a neural network
from keras.layers import Input, Convolution2D, MaxPooling2D, Dense, Dropout, Activation, Flatten
from keras.utils import np_utils # utilities for one-hot encoding of ground truth values
import numpy as np

batch_size = 32 # in each iteration, we consider 32 training examples at once
num_epochs = 200 # we iterate 200 times over the entire training set
kernel_size = 3 # we will use 3x3 kernels throughout
pool_size = 2 # we will use 2x2 pooling throughout
conv_depth_1 = 32 # we will initially have 32 kernels per conv. layer...
conv_depth_2 = 64 # ...switching to 64 after the first pooling layer
drop_prob_1 = 0.25 # dropout after pooling with probability 0.25
drop_prob_2 = 0.5 # dropout in the FC layer with probability 0.5
hidden_size = 512 # the FC layer will have 512 neurons

(X_train, y_train), (X_test, y_test) = cifar10.load_data() # fetch CIFAR-10 data

num_train, height, width, depth = X_train.shape # there are 50000 training examples in CIFAR-10
num_test = X_test.shape[0] # there are 10000 test examples in CIFAR-10
num_classes = np.unique(y_train).shape[0] # there are 10 image classes

X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= np.max(X_train) # Normalise data to [0, 1] range
X_test /= np.max(X_test) # Normalise data to [0, 1] range

Y_train = np_utils.to_categorical(y_train, num_classes) # One-hot encode the labels
Y_test = np_utils.to_categorical(y_test, num_classes) # One-hot encode the labels

inp = Input(shape=(height, width, depth)) # depth goes last in TensorFlow back-end (first in Theano)
# Conv [32] -> Conv [32] -> Pool (with dropout on the pooling layer)
conv_1 = Convolution2D(conv_depth_1, (kernel_size, kernel_size), padding='same', activation='relu')(inp)
conv_2 = Convolution2D(conv_depth_1, (kernel_size, kernel_size), padding='same', activation='relu')(conv_1)
pool_1 = MaxPooling2D(pool_size=(pool_size, pool_size))(conv_2)
drop_1 = Dropout(drop_prob_1)(pool_1)
# Conv [64] -> Conv [64] -> Pool (with dropout on the pooling layer)
conv_3 = Convolution2D(conv_depth_2, (kernel_size, kernel_size), padding='same', activation='relu')(drop_1)
conv_4 = Convolution2D(conv_depth_2, (kernel_size, kernel_size), padding='same', activation='relu')(conv_3)
pool_2 = MaxPooling2D(pool_size=(pool_size, pool_size))(conv_4)
drop_2 = Dropout(drop_prob_1)(pool_2)
# Now flatten to 1D, apply FC -> ReLU (with dropout) -> softmax
flat = Flatten()(drop_2)
hidden = Dense(hidden_size, activation='relu')(flat)
drop_3 = Dropout(drop_prob_2)(hidden)
out = Dense(num_classes, activation='softmax')(drop_3)

model = Model(inputs=inp, outputs=out) # To define a model, just specify its input and output layers

model.compile(loss='categorical_crossentropy', # using the cross-entropy loss function
model.evaluate(X_test, Y_test, verbose=1)  # Evaluate the trained model on the test set!
2. To get a feeling for why accuracy might be inappropriate for unbalanced datasets, consider an extreme case where 90% of the test data belongs to class $$x$$ (this could be, for example, the task of diagnosing patients for an extremely rare disease). In this case, a classifier that just outputs $$x$$ achieves a seemingly impressive accuracy of 90% on the test data, without really doing any learning/generalisation.