Home iOS & Swift Books Machine Learning by Tutorials

Going Convolutional Written by Matthijs Hollemans

It’s finally time to bring out the big guns and discover what deep learning is all about. In this chapter, you’ll convert the basic neural network into something that works much better on images. The secret ingredient is the convolutional layer.

Got GPU?

Having a GPU is no longer a luxury. Unfortunately, at this time, Keras and TensorFlow do not support Mac GPUs yet. Modern Macs ship with GPUs from Intel or AMD, while deep learning tools usually only cater to GPUs from NVIDIA. Older Macs may still have an NVIDIA on board, but these are often too old. Using an external eGPU enclosure with an NVIDIA card is an option but is not officially supported.

Most machine-learning practitioners train their models on a PC running Linux that has one or more NVIDIA GPUs, or in the cloud. The author has built a Linux PC with a GTX 1080 Ti GPU, especially for this purpose. If you’re serious about deep learning, this is an expense worth making.

If all you have is a Mac, you’ll need a lot of patience to train the models in this chapter. Because we want everyone to be able to follow along, the book’s download includes the full Jupyter notebooks that were used to train the models, as well as the final trained version, so you can skip training the models if your computer isn’t up to the task.

Note: Even though they have limitations, the big benefit of Create ML and Turi Create is that they support most Mac GPUs through Metal. No big surprise there, as both are provided by Apple. Let’s hope TensorFlow and other popular training tools will follow suit soon and support Metal, too. There’s no reason the Intel or AMD GPU in your Mac can’t compete with NVIDIA chips — the only thing missing is software support.

If you have a spare PC with a reasonably recent NVIDIA GPU, and you don’t mind installing Linux on it, then, by all means, give that a go. It’s also possible to use Keras and TensorFlow from Windows, but this is a bit wonkier. We suggest using Ubuntu from ubuntu.com, the most popular Linux for machine learning.

You will also need to install the NVIDIA drivers, as well as the CUDA and cuDNN libraries. See developer.nvidia.com for more details. To install the Python machine learning packages, we suggest using Conda as explained in Chapter 4, “Getting Started with Python & Turi Create.” The process is very similar on Linux and Windows.

Tip: If you’re installing TensorFlow by hand, make sure to install the tensorflow-gpu package instead of plain tensorflow. You can change this in kerasenv.yaml or run pip install -U tensorflow-gpu. Also, be sure to install the version of TensorFlow that goes with your version of CUDA and cuDNN. If these versions don’t match up, TensorFlow won’t work. Installing all this stuff can get messy, so it’s not for the faint-hearted — hey, it’s Linux!

Your head in the clouds?

If you’re just getting your feet wet and you’re not quite ready to build your own deep-learning rig, then the quickest way to get started with GPU training is to use the cloud. You can even use some of these cloud services for free!

Convolution layers

The models you’ve built in Keras have, so far, consisted of Dense layers, which take a one-dimensional vector as input. But images, by nature, have a width and a height, which is why you had to “flatten” the image first.

Convolution, say what now?

In case you have no idea what convolution is, rest assured that it sounds a lot more intimidating than it really is. Again, what it comes down to are dot products.

The convolution window slides over the image, left to right, top to bottom
Cbi junnozejoav livviq qpowap esiz wle ilote, yayz ci hohxz, miz le botmoc

y[i,j] = w[0,0]*x[i-1,j-1] + w[0,1]*x[i-1,j] + w[0,2]*x[i-1,j+1]
       + w[1,0]*x[i,  j-1] + w[1,1]*x[i,  j] + w[1,2]*x[i,  j+1]
       + w[2,0]*x[i+1,j-1] + w[2,1]*x[i+1,j] + w[2,2]*x[i+1,j+1]
       + bias
Each step computes a single output value from the 3×3 window at the center pixel
Iibc mney dijwetid u hurhfe oimwus ziyuo hwel gfu 6×0 bubket uw ldo diwwej gudag

Multiple filters

To keep the explanation simple, we claimed that the convolution uses a 3×3 window. That is certainly true, but this only accounts for the spatial dimensions — we should not ignore the depth dimension. Since images actually have three depth values for every pixel (RGB), the convolution really uses a 3×3×3 window and adds up the values across the three color channels.

The convolution kernel is really three-dimensional
Fci bawqirovoin jiwjav uz ceultz vzvui-xejapyuayef

The number of filters in the convolution layer determines the depth of its output image
Smi yursog oq datmorr ic ccu bowxopaceon xorix xemacdonuy cci sezzn ur ivb uuwpid iyico

Your first convnet in Keras

In a new Jupyter notebook, create the following cells. You can also follow along with ConvNet.ipynb.

import numpy as np
from keras.models import Sequential
from keras.layers import *
from keras import optimizers

%matplotlib inline
import matplotlib.pyplot as plt
image_width = 224
image_height = 224
num_classes = 20
model = Sequential()
model.add(Conv2D(32, 3, padding="same", activation="relu",
                 input_shape=(image_height, image_width, 3)))
model.add(Conv2D(32, 3, padding="same", activation="relu"))
model.add(Conv2D(64, 3, padding="same", activation="relu"))
model.add(Conv2D(64, 3, padding="same", activation="relu"))
model.add(Conv2D(128, 3, padding="same", activation="relu"))
model.add(Conv2D(128, 3, padding="same", activation="relu"))
model.add(Conv2D(256, 3, padding="same", activation="relu"))
model.add(Conv2D(256, 3, padding="same", activation="relu"))

The flow of the tensors

You can see what happens to the shape of the data in the model.summary(). The number of channels gradually goes up from 32 to 256 due to the increasing number of filters in the convolution layers, but the spatial dimensions shrink from 224×224 to 28×28 pixels because of the pooling layers:

Layer (type)                 Output Shape              Param # 
conv2d_1 (Conv2D)            (None, 224, 224, 32)      896     
conv2d_2 (Conv2D)            (None, 224, 224, 32)      9248    
max_pooling2d_1 (MaxPooling2 (None, 112, 112, 32)      0      
conv2d_3 (Conv2D)            (None, 112, 112, 64)      18496   
conv2d_4 (Conv2D)            (None, 112, 112, 64)      36928   
max_pooling2d_2 (MaxPooling2 (None, 56, 56, 64)        0         
conv2d_5 (Conv2D)            (None, 56, 56, 128)       73856     
conv2d_6 (Conv2D)            (None, 56, 56, 128)       147584    
max_pooling2d_3 (MaxPooling2 (None, 28, 28, 128)       0         
conv2d_7 (Conv2D)            (None, 28, 28, 256)       295168    
conv2d_8 (Conv2D)            (None, 28, 28, 256)       590080    
global_average_pooling2d_1 ( (None, 256)               0         
dense_1 (Dense)              (None, 20)                5140      
activation_1 (Activation)    (None, 20)                0         
Total params: 1,177,396
Trainable params: 1,177,396
Non-trainable params: 0
Each filter reads all input channels and produces one output channel
Aovg bowwoh yoorx ivc ugtac zkuskowj eyt lkiziwon awo ooblin npedgaf

More about pooling

After the first two convolution layers there is a pooling layer, max_pooling2d_1. The job of this layer is to halve the spatial dimensions of the tensor, producing a new tensor that is only 112×112 pixels wide and tall. The number of channels stays the same, 32.

Max pooling reduces each 2×2 pixels to a single number
Yax duevacs giserel uojc 6×7 wolapd lu i dobssa zejquv

The detected features

Following the max pooling layer are two more conv layers, this time with 64 output channels, and then there is another pooling layer, followed by two more conv layers. The model repeats this pattern a few times. The convolution layers have the job of filtering the data while the pooling layers reduce the dimensions.

The learned weights for the first conv layer
Wvo hiubsen naezljz xih cva xukpc zedh siket

Feeling hot hot hot

Back to that very last convolution layer that outputs a 28×28×256 tensor. That means, assuming the model is properly trained, this layer can recognize 256 different high-level patterns in the original input image. Even better, it can tell you roughly where in the original image these patterns appear.

A channel from the final tensor represented as a heatmap
O yxifnul vrey tki misuf fozrir guzvaqojjih ih o ciegsub

Honey, I shrunk the tensors!

It’s possible to Flatten the 28×28×256 tensor and train a logistic regression on top of it. That would turn the tensor into a 200,704-element vector. Recall from the last chapter that the logistic regression already had a hard enough time with just 3,072 features, let alone two-hundred thousand…

Global average pooling
Dsozes agiyige gaitomd

Training the model

The model you’ve built in the previous sections is a typical convnet design, and — although not necessarily the most optimal — it’s a good start. Let’s see how well this model learns.

images_dir = "snacks/"
train_data_dir = images_dir + "train/"
val_data_dir = images_dir + "val/"
test_data_dir = images_dir + "test/"

def normalize_pixels(image):
    return image / 127.5 - 1

from keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(

batch_size = 64

train_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

val_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

test_generator = datagen.flow_from_directory(
                    target_size=(image_width, image_height),

index2class = {v:k for k,v in 
histories = []
history = model.fit_generator(

Going dooooown?

To make a plot of the loss over time, do the following:

def combine_histories():
    history = { 
    	"loss": [], 
    	"val_loss": [], 
    	"acc": [], 
    	"val_acc": [] 
    for h in histories:
        for k in history.keys():
            history[k] += h.history[k]
    return history
history = combine_histories()
def plot_loss(history):
    fig = plt.figure(figsize=(10, 6))
    plt.legend(["Train", "Validation"])

The training and validation loss curves
Tga yziusarv acz qitajariib yufl gegyoz

def plot_accuracy(history):
    fig = plt.figure(figsize=(10, 6))
    plt.legend(["Train", "Validation"])

The training and validation accuracy over time
Ska zjeutewf ayp gavaxiqauk etdesovy onum libo

Learning rate annealing

One trick you can use to give the accuracy a little boost is to change the learning rate. It is currently 1e-3 or 0.001 (set when you compiled the model), and you can change it by doing the following:

import keras.backend as K
            K.get_value(model.optimizer.lr) / 10)
The loss after lowering the learning rate
Lfu juvk utlof sejehehc jyi jiibhicr xila

It’s better… but not good enough yet

It’s clear that you were able to create a much better model using these convolutional layers than with only Dense layers. The final test set accuracy for this model is about 40% correct, compared to only 15% from the last chapter. That’s a big improvement!

Key points

Where to go from here?

An accuracy of 40% means that four out of 10 predictions are correct, which is much better than the models from the previous chapter — but it still means that the other six predictions are wrong. To make this model better, you can add more convolutional layers or increase the number of filters in each layer, and that’s exactly what you’ll do in the next chapter.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Have feedback to share about the online reading experience? If you have feedback about the UI, UX, highlighting, or other features of our online readers, you can send them to the design team with the form below:

© 2021 Razeware LLC

You're reading for free, with parts of this chapter shown as obfuscated text. Unlock this book, and our entire catalogue of books and videos, with a raywenderlich.com Professional subscription.

Unlock Now

To highlight or take notes, you’ll need to own this book in a subscription or purchased by itself.