2. Getting Started with Image Classification
Written by Matthijs Hollemans

Let’s begin your journey into the world of machine learning by creating a binary image classifier.

A classifier is a machine learning model that takes an input of some kind, in this case an image, and determines what sort of “thing” that input represents. An image classifier tells you which category, or class, the image belongs to.

Binary means that the classifier is able to distinguish between two classes of objects. For example, you can have a classifier that will answer either “cat” or “dog” for a given input image, just in case you have trouble telling the two apart.

Being able to tell the difference between only two things may not seem very impressive, but binary classification is used a lot in practice.

In medical testing, it determines whether a patient has a disease, where the “positive” class means the disease is present and the “negative” class means it’s not. Another common example is filtering email into spam/not spam.

There are plenty of questions that have a definite “yes/no” answer, and the machine learning model to use for such questions is a binary classifier. The cats-vs.-dogs classifier can be framed as answering the question: “Is this a picture of a cat?” If the answer is no, it’s a dog.

Image classification is one of the most fundamental computer vision tasks. Advanced applications of computer vision — such as object detection, style transfer, and image generation — all build on the same ideas from image classification, making this a great place to start.

There are many ways to create an image classifier, but by far the best results come from using deep learning. The success of deep learning in image classification is what started the current hype around AI and ML. We wouldn’t want you to miss out on all this exciting stuff, and so the classifier you’ll be building in this chapter uses deep learning under the hood.

Is that snack healthy?

In this chapter you’ll learn how to build an image classifier that can tell the difference between healthy and unhealthy snacks.

To get started, make sure you’ve downloaded the supplementary materials for this chapter and open the HealthySnacks starter project in Xcode.

This is a very basic iPhone app with two buttons, an image view, and a text label at the top:

The “picture frame” button on the left lets you choose a photo from the library using UIImagePickerController. The “camera” button on the right lets you take a picture with the camera (this button is disabled in the simulator).

Once you’ve selected a picture, the app calls classify(image:) in ViewController.swift to decide whether the image is of a healthy snack or not. Currently this method is empty. In this chapter you’ll be adding code to this method to run the classifier.

At this point, it’s a good idea to take a brief look at ViewController.swift to familiarize yourself with the code. It’s pretty standard fare for an iOS app.

In order to do machine learning on the device, you need to have a trained model. For the HealthySnacks app, you’ll need a model that has learned how to tell apart healthy snacks from unhealthy snacks. In this chapter you’ll be using a ready-made model that has already been trained for you, and in the next chapter you’ll learn to how train this model yourself.

The model is trained to recognize the following snacks:

For example, if you point the camera at an apple and snap a picture, the app should say “healthy”. If you point the camera at a hotdog, it should say “unhealthy”.

What the model actually predicts is not just a label (“healthy” or “unhealthy”) but a probability distribution, where each classification is given a probability value:

If your math and statistics are a little rusty, then don’t let terms such as “probability distribution” scare you. A probability distribution is simply a list of positive numbers that add up to 1.0. In this case it is a list of two numbers because this model has two classes:

[0.15, 0.85]

The above prediction is for an image of a waffle with strawberries on top. The model is 85% sure that the object in this picture is unhealthy. Because the predicted probabilities always need to add up to 100% (or 1.0), this outcome also means the classifier is 15% sure this snack is healthy — thanks to the strawberries.

You can interpret these probabilities to be the confidence that the model has in its predictions. A waffle without strawberries would likely score higher for unhealthy, perhaps as much as 98%, leaving only 2% for class healthy. The more confident the model is about its prediction, the more one of the probabilities goes to 100% and the other goes to 0%. When the difference between them is large, as in this example, it means that the model is sure about its prediction.

Ideally, you would have a model that is always confident and never wrong. However, sometimes it’s very hard for the model to draw a solid conclusion about the image. Can you tell whether the food in the following image is mostly healthy or unhealthy?

The less confident the model is, the more both probabilities go towards the middle, or 50%.

When the probability distribution looks like the following, the model just isn’t very sure, and you cannot really trust the prediction — it could be either class.

This happens when the image has elements of both classes — salad and greasy stuff — so it’s hard for the model to choose between the two classes. It also happens when the image is not about food at all, and the model does not know what to make of it.

To recap, the input to the image classifier is an image and the output is a probability distribution, a list of numbers between 0 and 1.

Since you’re going to be building a binary classifier, the probability distribution is made up of just two numbers. The easiest way to decide which class is the winner is to choose the one with the highest predicted probability.

Note: To keep things manageable for this book, we only trained the model on twenty types of snacks (ten healthy, ten unhealthy). If you take a picture of something that isn’t in the list of twenty snacks, such as broccoli or pizza, the prediction could be either healthy or unhealthy. The model wasn’t trained to recognize such things and, therefore, what it predicts is anyone’s guess. That said, the model might still guess right on broccoli (it’s green, which is similar to other healthy snacks) and pizza (it’s greasy and therefore unhealthy).

Core ML

For many of the projects in this book, you’ll be using Core ML, Apple’s machine learning framework that was introduced with iOS 11. Core ML makes it really easy to add machine learning models to your app — it’s mostly a matter of dropping a trained model into your app and calling a few API functions.

Xcode even automatically writes most of the code for you.

Of course, Core ML is only easy if you already have a trained model. You can find the model for this chapter, HealthySnacks.mlmodel, in the downloaded resources.

Core ML models are packaged up in a .mlmodel file. This file contains both the structural definition of the model as well as the things it has learned, known as the learned parameters (or the “weights”).

With the HealthySnacks project open in Xcode, drag the HealthySnacks.mlmodel file into the project to add it to the app (or use File ▸ Add Files).

Select HealthySnacks.mlmodel in the Project Navigator and Xcode will show the following:

This is a summary of the Core ML model file. It shows what of type model it is, the size of the model in megabytes and a description.

The HealthySnacks model type is Neural Network Classifier, which means it is an image classifier that uses deep learning techniques. The terms “deep learning” and “neural network” mean pretty much the same thing. According to the description, this model was made using a tool called Turi Create and it uses SqueezeNet v1.1, a popular deep learning architecture for mobile apps.

The main benefit of SqueezeNet is that it’s small. As you can see in Xcode, the size of this model is “only” 5 MB. That is tiny compared to many other deep learning model architectures, which can take up hundreds of MBs. Such large models are usually not a good choice for use in a mobile app. Not only do they make the app download bigger but larger models are also slower and use more battery power.

The Prediction section lists the inputs that the model expects and the outputs that it produces. Since this is an image classifier there is only one input, a color image that must be 227 pixels wide and 227 pixels tall.

You cannot use images with other dimensions. The reason for this restriction is that the SqueezeNet architecture expects an image of exactly this size. If it’s any smaller or any larger, the math used by SqueezeNet doesn’t work out. This means that any image you pick from the photo library or take with the camera must be resized to 227×227 before you can use it with this Core ML model.

Note: If you’re thinking that 227×227 pixels isn’t very big, then you’re right. A typical 12-megapixel photo is 4032×3024 — that is more than 200 times as many pixels! But there is a trade-off between image size and processing time. These deep learning models need to do a lot of calculations: For a single 227×227 image, SqueezeNet performs 390 million calculations. Make the image twice as large and the number of calculations also doubles. At some point, that just gets out of hand and the model will be too slow to be useable!

Making the image smaller will make the model faster, and it can even help the models learn better since scaling down the image helps to remove unnecessary details that would otherwise just confuse the model. But there’s a limit here too: At some point, the image loses too much detail, and the model won’t be able to do a good job anymore. For image classification, 227×227 is a good compromise. Other typical image sizes used are 224×224 and 299×299.

The HealthySnacks model has two outputs. It puts the probability distribution into a dictionary named labelProbability that will look something like this:

labelProbability = [ "healthy": 0.15, "unhealthy": 0.85 ]

For convenience, the second output from the model is the class label of the top prediction: "healthy" if the probability of the snack being healthy is greater than 50%, "unhealthy" if it’s less than 50%.

The final section of this model summary to look at is Model Class. When you add an .mlmodel file to a project, Xcode does something smart behind the scenes: It creates a Swift class with all the source code needed to use the model in your app. That means you don’t have to write any code to load the .mlmodel — Xcode has already done the heavy lifting for you.

To see the code that Xcode generated, click the little arrow next to the model name:

Click the arrow to view the generated code

It’s not important, at this point, that you understand exactly what this code does. Just notice that the automatically generated Swift file contains a class HealthySnacks that has an MLModel object property (the main object from the Core ML framework). It also has prediction methods for making the classifications. There also are HealthySnacksInput and HealthySnacksOutput classes that represent the input (an image) and outputs (the probabilities dictionary and the top prediction label) of the model.

At this point, you might reasonably expect that you’re going to use these automatically generated classes to make the predictions. Surprise… you’re not! We’re saving that for the end of the chapter.

There are a few reasons for this, most importantly that the images need to be scaled to 227×227 pixels and placed into a CVPixelBuffer object before you can call the prediction method, and we’d rather not deal with that if we can avoid it. So instead, you’re going to be using yet another framework: Vision.

Note: Core ML models can also have other types of inputs besides images, such as numbers and text. In this first section of the book, you’ll primarily work with images. In later sections, you’ll also do machine learning on other types of data.

Vision

Along with Core ML, Apple also introduced the Vision framework in iOS 11. As you can guess from its name, Vision helps with computer vision tasks. For example, it can detect rectangular shapes and text in images, detect faces and even track moving objects.

Most importantly for you, Vision makes it easy to run Core ML models that take images as input. You can even combine this with other Vision tasks into an efficient image-processing pipeline. For example, in an app that detects people’s emotions, you can build a Vision pipeline that first detects a face in the image and then runs a Core ML-based classifier on just that face to see whether the person is smiling or frowning.

It’s highly recommended that you use Vision to drive Core ML if you’re working with images. Recall that the HealthySnacks model needs a 227×227 image as input, but images from the photo library or the camera will be much larger and are typically not square. Vision will automatically resize and crop the image.

In the automatically generated Swift file for the .mlmodel, you may have noticed that the input image (see HealthySnacksInput) has to be a CVPixelBuffer object, while UIImagePickerController gives you a UIImage instead. Vision can do this conversion for you, so you don’t have to worry about CVPixelBuffer objects.

Finally, Vision also performs a few other tricks, such as rotating the image so that it’s always right-size up, and matching the image’s color to the device’s color space. Without the Vision framework, you’d have to write a lot of additional code! Surely, you’ll agree that it’s much more convenient to let Vision handle all these things.

Note: Of course, if you’re using a model that does not take images as input, you can’t use Vision. In that case, you’ll have to use the Core ML API directly.

The way Vision works is that you create a VNRequest object, which describes the task you want to perform, and then you use a VNImageRequestHandler to execute the request. Since you’ll use Vision to run a Core ML model, the request is a subclass named VNCoreMLRequest. Let’s write some code!

Creating the VNCoreML request

To add image classification to the app, you’re going to implement classify(image:) in ViewController.swift. This method is currently empty.

Here, you’ll use Vision to run the Core ML model and interpret its results. First, add the required imports to the top of the file:

import CoreML
import Vision

Next, you need to create the VNCoreMLRequest object. You typically create this request object once and re-use it for every image that you want to classify. Don’t create a new request object every time you want to classify an image — that’s wasteful.

In ViewController.swift, add the following code inside the ViewController class below the @IBOutlets:

lazy var classificationRequest: VNCoreMLRequest = {
  do {
    // 1
    let healthySnacks = HealthySnacks()
    // 2
    let visionModel = try VNCoreMLModel(
      for: healthySnacks.model)
    // 3
    let request = VNCoreMLRequest(model: visionModel,
                                  completionHandler: {
      [weak self] request, error in
      print("Request is finished!", request.results)
    })
    // 4
    request.imageCropAndScaleOption = .centerCrop
    return request
  } catch {
    fatalError("Failed to create VNCoreMLModel: \(error)")
  }
}()

Here’s what this code does:

Create an instance of HealthySnacks. This is the class from the .mlmodel file’s automatically generated code. You won’t use this class directly, only so you can pass its MLModel object to Vision.

Create a VNCoreMLModel object. This is a wrapper object that connects the MLModel instance from the Core ML framework with Vision.

Create the VNCoreMLRequest object. This object will perform the actual actions of converting the input image to a CVPixelBuffer, scaling it to 227×227, running the Core ML model, interpreting the results, and so on.

Since Vision requests run asynchronously, you can supply a completion handler that will receive the results. For now, the completion handler just prints something to the Xcode debug output pane. You will flesh this out later.

The imageCropAndScaleOption tells Vision how it should resize the photo down to the 227×227 pixels that the model expects.

The code is wrapped up in a do catch because loading the VNCoreMLModel object can fail if the .mlmodel file is invalid somehow. That should never happen in this example project, and so you handle this kind of error by crashing the app. It is possible for apps to download an .mlmodel file and, if the download fails, the .mlmodel can get corrupted. In that case, you’ll want to handle this error in a more graceful way.

Note: The classificationRequest variable is a lazy property. In case you’re unfamiliar with lazy properties, this just means that the VNCoreMLRequest object is not created until the very first time you use classificationRequest in the app.

Crop and scale options

It has been mentioned a few times now that the model you’re using, which is based on SqueezeNet, requires input images that are 227×227 pixels. Since you’re using Vision, you don’t really need to worry about this — Vision will automatically scale the image to the correct size. However, there is more than one way to resize an image, and you need to choose the correct method for the model, otherwise it might not work as well as you’d hoped.

What the correct method is for your model depends on how it was trained. When a model is trained, it’s shown many different example images to learn from. Those images have all kinds of different dimensions and aspect ratios, and they also need to be resized to 227×227 pixels. There are different ways to do this and not everyone uses the same method when training their models.

For the best results you should set the request’s imageCropAndScaleOption property so that it uses the same method that was used during training.

Vision offers three possible choices:

centerCrop

scaleFill

scaleFit

The .centerCrop option first resizes the image so that the smallest side is 227 pixels, and then it crops out the center square:

Note that this removes pixels from the left and right edges of the image (or from the top/bottom if the image is in portrait). If the object of interest happens to be in that part of the image, then this will throw away useful information and the classifier may only see a portion of the object. When using .centerCrop it’s essential that the user points the camera so that the object is in the center of the picture.

With .scaleFill, the image gets resized to 227×227 without removing anything from the sides, so it keeps all the information from the original image — but if the original wasn’t square then the image gets squashed. Finally, .scaleFit keeps the aspect ratio intact but compensates by filling in the rest with black pixels.

For the Healthy Snacks app, you’ll use .centerCrop as that’s also the resizing strategy that was used to train the model. Just make sure that the object you’re pointing the camera at is near the center of the picture for the best results. Feel free to try out the other scaling options to see what kind of difference they make to the predictions, if any.

Performing the request

Now that you have the request object, you can implement the classify(image:) method. Add the following code to that method:

func classify(image: UIImage) {
  // 1
  guard let ciImage = CIImage(image: image) else {
    print("Unable to create CIImage")
    return
  }
  // 2
  let orientation = CGImagePropertyOrientation(
    image.imageOrientation)
  // 3
  DispatchQueue.global(qos: .userInitiated).async {
    // 4
    let handler = VNImageRequestHandler(
      ciImage: ciImage,
      orientation: orientation)
    do {
      try handler.perform([self.classificationRequest])
    } catch {
      print("Failed to perform classification: \(error)")
    }
  }
}

The image that you get from UIImagePickerController is a UIImage object but Vision prefers to work with CGImage or CIImage objects. Either will work fine, and they’re both easy to obtain from the original UIImage. The advantage of using a CIImage is that this lets you apply additional Core Image transformations to the image, for more advanced image processing.

Here is what the method does, step-by-step:

Converts the UIImage to a CIImage object.

The UIImage has an imageOrientation property that describes which way is up when the image is to be drawn. For example, if the orientation is “down,” then the image should be rotated 180 degrees. You need to tell Vision about the image’s orientation so that it can rotate the image if necessary, since Core ML expects images to be upright.

Because it may take Core ML a moment or two to do all the calculations involved in the classification (recall that SqueezeNet does 390 million calculations for a single image), it is best to perform the request on a background queue, so as not to block the main thread.

Create a new VNImageRequestHandler for this image and its orientation information, then call perform to actually do execute the request. Note that perform takes an array of VNRequest objects, so that you can perform multiple Vision requests on the same image if you want to. Here, you just use the VNCoreMLRequest object from the classificationRequest property you made earlier.

The above steps are pretty much the same for any Vision Core ML app.

Because you made the classificationRequest a lazy property, the very first time classify(image:) gets called it will load the Core ML model and set up the Vision request. But it only does this once and then re-uses the same request object for every image. On the other hand, you do need to create a new VNImageRequestHandler every time, because this handler object is specific to the image you’re trying to classify.

Image orientation

When you take a photo with the iPhone’s camera, regardless of how you’re holding the phone, the image data is stored as landscape because that’s the native orientation of the camera sensor.

iOS keeps track of the true orientation of the image with the imageOrientation property. For an image in your photo album, the orientation information is stored in the image file’s EXIF data.

If you’re holding the phone in portrait mode and snap a picture, its imageOrientation will be .right to indicate the camera has been rotated 90 degrees clockwise. 0 degrees means that the phone was in landscape with the Home button on the right.

An imageOrientation of .up means that the image already has the correct side up. This is true for pictures taken in landscape but also for portrait pictures from other sources, such as an image you create in Photoshop.

Most image classification models expect to see the input image with the correct side up. Notice that the Core ML model does not take “image orientation” as an input, so it will see only the “raw” pixels in the image buffer without knowing which side is up.

Image classifiers are typically trained to account for images being horizontally flipped so that they can recognize objects facing left as well as facing right, but they’re usually not trained to deal with images that rotated by 90, 180 or 270 degrees.

If you pass in an image that is not oriented properly, the model may not give accurate predictions because it has not learned to look at images that way.

This is why you need to tell Vision about the image’s orientation so that it can properly rotate the image’s pixels before they get passed to Core ML. Since Vision uses CGImage or CIImage instead of UIImage, you need to convert the UIImage.Orientation value to a CGImagePropertyOrientation value.

Trying it out

At this point, you can build and run the app and choose a photo.

It’s possible to run this app in the Simulator but only the photo library button is active. The photo library on the Simulator doesn’t contain pictures of snacks by default, but you can add your own by Googling for images and then dragging those JPEGs or PNGs into the Photos app.

Run the app on a device to use the camera, as the Simulator does not support taking pictures.

Take or choose a picture, and the Xcode debug pane will output something like this:

Request is finished! Optional([<VNClassificationObservation: 0x60c00022b940> B09B3F7D-89CF-405A-ABE3-6F4AF67683BB 0.81705 "healthy" (0.917060), <VNClassificationObservation: 0x60c000223580> BC9198C6-8264-4B3A-AB3A-5AAE84F638A4 0.18295 "unhealthy" (0.082940)])

This is the output from the print statement in the completion handler of the VNCoreMLRequest. It prints out the request.results array. As you can see, this array contains two VNClassificationObservation objects, one with the probability for the healthy class (0.917060 or 91.7%) and the other with the probability for the unhealthy class (0.082940 or 8.29%).

Of course, printing stuff to the output pane isn’t very exciting, so let’s properly show these results in the app.

Showing the results

Inside the declaration of lazy var classificationRequest, change the completion handler for the VNCoreMLRequest object to the following:

let request = VNCoreMLRequest(
  model: visionModel,
  completionHandler: { [weak self] request, error in
    // add this
    self?.processObservations(for: request, error: error)
  })

Instead of the print statement that was there previously, you’re now calling a new method, processObservations(for:error:). It’s perfectly possible to put the code that handles the results directly inside the completion handler, but it tends to make the code harder to read.

Add the new method to ViewController.swift:

func processObservations(
  for request: VNRequest,
  error: Error?) {
  // 1
  DispatchQueue.main.async {
    // 2
    if let results = request.results
      as? [VNClassificationObservation] {
      // 3
      if results.isEmpty {
        self.resultsLabel.text = "nothing found"
      } else {
        self.resultsLabel.text = String(
          format: "%@ %.1f%%",
          results[0].identifier,
          results[0].confidence * 100)
      }
    // 4
    } else if let error = error {
      self.resultsLabel.text =
        "error: \(error.localizedDescription)"
    } else {
      self.resultsLabel.text = "???"
    }
    // 5
    self.showResultsView()
  }
}

Here’s what this method does, step-by-step:

The request’s completion handler is called on the same background queue from which you launched the request. Because you’re only allowed to call UIKit methods from the main queue, the rest of the code in this method runs on the main queue.

The request parameter is of type VNRequest, the base class of VNCoreMLRequest. If everything went well, the request’s results array contains one or more VNClassificationObservation objects. If the cast fails, it’s either because there was an error performing the request and results is nil, or the array contains a different type of observation object, which happens if the model isn’t actually a classifier or the Vision request object wasn’t for a Core ML model.

Put the class name in the results label. Assuming the array is not empty, it contains a VNClassificationObservation object for each possible class. Each of these has an identifier (the name of the class: “healthy” or “unhealthy”) and a confidence score. This score is how likely the model thinks the object is of this class; in other words, it’s the probability for that class.

Vision automatically sorts the results by confidence, so results[0] contains the class with the highest confidence — the winning class. The app will show both the name and confidence in the results label, where the confidence is shown as a percentage, e.g., "healthy 95%".

By the way, it should never happen that the array is empty but, in the unlikely case that it is, you show a “nothing found” message in the label.

Just in case something went wrong with the request, show an error message. This normally shouldn’t happen, but it’s good to cover all your bases.

Finally, show the resultsLabel on the screen. The showResultsView() method performs a nice little animation, which makes it clear to the user that their image has been classified.

And that’s all you need to do. Build and run the app and classify some images!

Pretty cool. With just a few lines of code you’ve added a state-of-the-art image classifier to your app!

Note: When you viewed the Core ML model in Xcode (by selecting the .mlmodel file in the Project navigator), it said that the model had two outputs: a dictionary containing the probabilities and the label for the top prediction. However, the Vision request gives you an array of VNClassificationObservation objects instead. Vision takes that dictionary from Core ML and turns it into its own kind of “observation” objects. Later on, you’ll see how to use Core ML directly, without using Vision, and, in that case, you do get access directly to the model’s outputs.

What if the image doesn’t have a snack?

The app shows the winning class and the confidence it has in this prediction. In the above image on the left, the class is “healthy” and the confidence is 94.8%.

If the output is something like “healthy 95%,” the model feels pretty sure about itself. You’ll see this kind of prediction on pictures of oranges, apples, bananas and so on. Likewise, if the output is “unhealthy 95%,” the model is pretty sure that it’s correct about the snack being unhealthy, and you’ll see this on pictures of pretzels and waffles. That’s good, we like to see confident predictions.

The model used in this app was trained on 20 different types of snacks. But what happens when you show it a kind of snack that it has never seen before, or maybe even a totally different kind of object — maybe something that isn’t even edible?

Since a binary classifier only understands two classes, it puts any picture that you give it into the “healthy” category or into the “unhealthy” category, even if the picture isn’t really of a kind of snack that it knows about.

This particular classifier is trained to tell the difference between healthy and unhealthy snacks, and it should therefore be used only with photos of such snacks. For all other images — let’s say of cute cats — the classifier will give a non-sensical prediction. After all, it only has “healthy” or “unhealthy” to choose from. (And no, we do not endorse having cats as a snack.)

What you want to happen for such an “unsupported” input image is that the model gives a very uncertain prediction, something that is more like a 51%–49% split. In that case, Vision might return two VNClassificationObservation objects like this:

element 0: healthy 51%
element 1: unhealthy 49%

If the model isn’t sure, that’s actually a very acceptable answer: It could be either class. However, since Vision automatically sorts this array by confidence score, the app will show the prediction “healthy” as the winning label. But is it really? Since the model is so uncertain now, changing these percentages only slightly can completely change the outcome:

element 0: unhealthy 52%
element 1: healthy 48%

If you get such a prediction for one of your photos, try taking the same photo again but from a slightly different angle. The small variation between the photos can easily flip the uncertain prediction from one class to the other.

The moral of the story is that when the probabilities get close to 50%–50%, the model doesn’t really know what to make of the image. It’s a good idea to make the app deal with such situations. After all, there is nothing that prevents the user from taking a photo of something that is not a snack.

In processObservations(for:error:), add the following clause to the if statement:

if results.isEmpty {
  . . .
} else if results[0].confidence < 0.8 {
  self.resultsLabel.text = "not sure"
} else {
  . . .

Here, we’ve chosen a threshold value of 0.8 (or 80% confidence). If the model was less confident about its winning prediction than this threshold, you decide that you can’t trust the prediction it made, and the app will say “not sure.”

The threshold value of 0.8 was picked arbitrarily. This is something you would test in practice by pointing the phone at many real-world objects to get a feel for what confidence level is trustworthy and below which level the model starts to make too many mistakes. This is actually different for every model, and so you need to test it in practice. There are also mathematical ways to find a suitable threshold, such as using a Precision-Recall curve or the Receiver Operator Characteristic (ROC) curve.

Note: Remember that it doesn’t make sense to test for a confidence below 0.5, as the winning prediction will always have a confidence score of greater than 50%. There are only two classes in a binary classifier and their total confidence score needs to add up to 100%.

However, it can still happen that you run into a situation like this:

The model was quite confident about this prediction even though the object is far from edible! Sometimes the classifier will give a very confident answer that is totally wrong. This is a limitation of all classifiers.

It’s important to understand that machine learning models will only work reliably when you use them with data that is very similar to the data they’ve been trained on. A model can only make trustworthy predictions on the types of things it has learned about — it will fail spectacularly on anything else. Machine learning often seems like magic… but it does have its limitations.

The only way to fix this kind of problem is to make your model more robust by training it on more images, or by adding a third category so that the model can learn the difference between “healthy snack,” “unhealthy snack,” and “not a snack.” But even then your model will still make errors. Using machine learning for computer vision tasks works really well, but it’s never perfect.

In the chapter on training, you’ll see how you can estimate the quality of the model to get an idea of how well it will work in practice.

What if there’s more than one object in the image?

Image classification always looks at the entire image and tries to find out what the most prominent object in the image is. But nothing stops you from running an image classifier on a picture containing objects from more than one class:

In this example, the classifier has found both an apple and a hotdog, but it seems to think that the hot dog is slightly more important. Perhaps it’s because the hot dog takes up more room in the image, or maybe the model just had a harder time recognizing the apples. In any case, it had to make an impossible choice between two classes that are really supposed to be mutually exclusive and this is what it came up with.

However, based on these percentages, you can’t just say, “This image contains an unhealthy snack.” It does, but it also contains a healthy snack. With the new rule that we just added, the model would say “not sure” for this particular photo, since neither class has over 80% confidence.

But it’s also possible that the model predicts something like 90% healthy or unhealthy for an image such as this. All bets are off, since this is not a problem the HealthySnacks model was really trained for. With an image classifier like this, the input image is really supposed to contain one “main” object, not multiple objects — or at most multiple objects that are all from the same class. The model can’t really handle images with more than one object if they are from different classes.

In any case, image classification works best when there is just a single object in the image. The computer vision task that’s about finding all the objects in an image, and also where they are located in the image, is called object detection and we’ll talk about that in chapter 9, “Beyond Image Classification.”

How does it work?

At this point, you may be wondering exactly how this Core ML model is able to tell apart healthy snacks from unhealthy snacks. The model takes an image as input and produces a probability distribution as output, but what is the magic that makes this happen? Let’s peek under the hood a little.

The HealthySnacks.mlmodel is a so-called neural network classifier. You’ve already seen classification, but you may not know exactly what a neural network is.

Artificial neural networks are inspired by the human brain. The particular neural network used by HealthySnacks is a so-called “convolutional” neural network, which in many ways is similar to how the human visual cortex processes information.

Despite how they’re often depicted in the popular press, it’s really not that useful to think of these artificial neural networks as a computerized version of human brains. Artificial neural networks are only a very crude model of how the human brain works — and not nearly as complicated.

It’s much more constructive to think of a neural network as a pipeline that transforms data in several different stages. A machine learning model is like a Swift function:

let outputs = myModel(inputs)

In the case of an image classifier, the function signature looks like the following, where the input is an image of some kind and the output an array of numbers, the probability distribution over the classes:

func myModel(input: Image) -> [Double] {
  // a lot of fancy math
}

Core ML treats the model as a black box, where input goes into one end and the output comes out the other. Inside this black box it actually looks like a pipeline with multiple stages:

Each of these stages, or layers as we call them, transforms the data in some way. In code, you can think of it as a sequence of map, filter, and reduce operations:

func myModel(input: Image) -> [Double] {
  return input.map({...}).filter({...}).map({...}).reduce({...})
}

That’s really all there is to it. Despite its sci-fi name, a neural network is a very straightforward thing, just a series of successive stages that each transforms the data in its own way, until the data comes out in the form you want. The layers inside an image classifier transform the data from an image into a probability distribution.

In modern neural networks, pipelines are not just a straight series of transformations but they can branch and the results of branches can be combined again in a later stage.

For example, the SqueezeNet neural network architecture that the HealthySnacks model is based on looks something like this:

All the magic happens inside the layers that perform the transformations. So surely that must involve lots of complicated math? Well, no. Each individual transformation is a relatively simple mathematical operation. The power of the neural network comes from combining these transformations. By putting many simple transformations together, you end up with a pipeline that can compute the answers to some pretty complex problems.

Early neural networks only used two or three layers (transformations), as training with more layers was fraught with problems. But those problems have been solved in recent years and now we routinely use neural networks with dozens or even hundreds of layers, which is why using these neural nets is called “deep learning.” SqueezeNet has 67 layers although in practice certain types of layers are fused together for better speed.

Into the next dimension

Let’s dive a little deeper into the math, just so you get a better conceptual idea of what these transformations do. Neural networks, like most machine learning models, can only work with numerical data. Fortunately for us, the data we care about in this chapter — the input image and the output probabilities — are all represented as numbers already. Models that work on data such as text would first need to convert that data into numbers.

The input image is 227×227 pixels and is a color image, so you need 227 × 227 × 3 = 154,587 numbers to describe an input image. For the sake of explanation, let’s round this down to 150,000 numbers.

Note: Each pixel needs three numbers because color is stored as RGB: a red, green and blue intensity value. Some images also have a fourth channel, the alpha channel, that stores transparency information, but this is typically not used by image classifiers. It’s OK to use an RGBA image as input, but the classifier will simply ignore the alpha value.

Here’s the big idea: Each of the 227×227 input images can be represented by a unique point in a 150,000-dimensional space.

Whoop, try to wrap your head around that… It’s pretty easy for us humans to think in 3D space but not so much in higher-dimensional spaces, especially not ones with hundreds of thousands of dimensions. But the principle is the same: given 3 numbers (x, y, z) you can describe any point in 3-dimensional space, right? Well, given 150,000 numbers with the RGB values of all the pixels in the image, you end up at a point in 150,000-dimensional space.

By the way, don’t try to think in 150,000 dimensions. Just imagine a 3D space and pretend it’s more than three dimensions. That’s what everyone else does too, since humans simply aren’t capable of visualizing more than three dimensions.

To classify the images, you want to be able to draw a line through this high-dimensional space and say, “All the images containing healthy snacks are on this side of the line, and all the images with unhealthy snacks are on the other side.” If that would be possible, then classifying an image is easy: You just have to look at which side of the line the image’s point falls.

The decision boundary divides up the space into two classes

This line is called the decision boundary. It’s the job of the classifier model to learn where that decision boundary lies. Math alert: It’s not really a line but a hyperplane, which is a subspace that splits the high-dimensional space into two halves. One of the benefits of being a machine learning practitioner is that you get to use cool words such as hyperplane.

The problem is that you cannot draw a nice line — or hyperplane — through the 150,000-dimensional pixel space because ordering the images by their pixel values means that the healthy and unhealthy images are all over the place.

Since pixels capture light intensity, images that have the same color and brightness are grouped together, while images that have different colors are farther apart. Apples can be red or green but, in pixel space, such images are not close together. Candy can also be red or green, so you’ll find pictures of apples mixed up with pictures of candy.

You cannot just look at how red or green something is to decide whether this image contains something healthy or unhealthy.

All the information you need to make a classification is obviously contained in the images, but the way the images are spread out over this 150,000-dimensional pixel space is not very useful. What you want instead is a space where all the healthy snacks are grouped together and all the unhealthy snacks are grouped together, too.

This is where the neural network comes in: The transformations that it performs in each stage of the pipeline will twist, turn, pull and stretch this coordinate space, until all the points that represent healthy snacks will be over on one side and all the points for unhealthy snacks will be on the other, and you can finally draw that line between them.

A concrete example

Here is a famous example that should illustrate the idea. In this example the data is two-dimensional, so each input consists of only two numbers (x, y). This is also a binary classification problem, but in the original coordinate space it’s impossible to draw a straight line between the two classes:

In theory, you could classify this dataset by learning to separate this space using an ellipse instead of a straight line, but that’s rather complicated. It’s much easier to perform a smart transformation that turns the 2D space into a 3D space by giving all points a z-coordinate too. The points from class A (the triangles) get a small z value, the points from class B (the circles) get a larger z value.

Now the picture looks like this:

After applying this transformation, both classes get cleanly separated. You can easily draw a line between them at z = 0.5. Any point with z-coordinate less than 0.5 belongs to class A, and any point with z greater than 0.5 belongs to class B.

The closer a point’s z-coordinate is to the line, the less confident the model is about the class for that point. This also explains why probabilities get closer to 50% when the HealthySnacks model can’t decide whether the snack in the image is healthy or unhealthy. In that case, the image gets transformed to a point that is near the decision boundary. Usually, the decision boundary is a little fuzzy and points with z close to 0.5 could belong to either class A (triangles) or class B (circles).

The cool thing about neural networks is that they can automatically learn to make these kinds of transformations, to convert the input data from a coordinate space where it’s hard to tell the points apart, into a coordinate space where it’s easy. That is exactly what happens when the model is trained. It learns the transformations and how to find the best decision boundary.

To classify a new image, the neural network will apply all the transformations it has learned during training, and then it looks at which side of the line the transformed image falls. And that’s the secret sauce of neural network classification!

The only difference between this simple example and our image classifier is that you’re dealing with 150,000 dimensions instead of two. But the idea – and the underlying mathematics — is exactly the same for 150,000 dimensions as it is for two.

Note: In general, the more complex the data, the deeper the neural network has to be. For the 2D example above, a neural net with just two layers will suffice. For images, which are clearly much more complex, the neural net needs to be deeper because it needs to perform more transformations to get a nice, clean decision boundary.

Over the course of the next chapters, we’ll go into more details about exactly what sort of transformations are performed by the neural network. In a typical deep learning model, these are convolutions (look for patterns made by small groups of pixels, thereby mapping the points from one coordinate space to another), pooling (reduce the size of the image to make the coordinate space smaller), and logistic regression (find where to draw the line / decision boundary).

Multi-class classification

So far, we’ve covered binary classification in which there are only two classes, but it’s also really easy to use a model that can handle multiple classes. This is called… wait for it… a multi-class classifier — or, sometimes, a multinomial classifier.

In this section, you’ll swap out the binary classifier for MultiSnacks.mlmodel, a multi-class classifier that was trained on the exact same data as the binary healthy/unhealthy classifier but that can detect the individual snacks.

Integrating this new model into the app couldn’t be simpler. You can either do this in a copy of your existing app or use the MultiSnacks starter app.

Now, drag the MultiSnacks.mlmodel from this chapter’s downloaded resources into the Xcode project.

If you look at this new .mlmodel file in Xcode, or at the automatically generated code, you’ll notice that it looks exactly the same as before, except that the names of the Swift classes are different (MultiSnacks instead of HealthySnacks) because the name of the .mlmodel file is different, too.

To use this new model, make the following change on the classificationRequest property:

lazy var classificationRequest: VNCoreMLRequest = {
  do {
    let multiSnacks = MultiSnacks()
    let visionModel = try VNCoreMLModel(for: multiSnacks.model)
    . . .

Instead of creating an instance of HealthySnacks, all you need to do is make an instance of MultiSnacks. This is the name of the class that Xcode generated automatically when you added MultiSnacks.mlmodel to the project.

Also change the innermost if statement in processObservations(for:error:) to:

if results.isEmpty {
  self.resultsLabel.text = "nothing found"
} else {
  let top3 = results.prefix(3).map { observation in
    String(format: "%@ %.1f%%", observation.identifier,
           observation.confidence * 100)
  }
  self.resultsLabel.text = top3.joined(separator: "\n")
}

Instead of showing only the best result — the class with the highest confidence score — this now displays the names of the three best classes.

Since the model was trained on 20 different object types, it outputs a probability distribution that looks something like this:

Where previously there were only two values (healthy/unhealthy), there are now 20 possible outcomes, and the 100 total percentage points are distributed over these twenty possible classes — which is why it’s called a probability distribution.

The app displays the three predicted classes with the highest probability values. Since there are now 20 classes, the results array contains 20 VNClassificationObservation objects, sorted from a high to low confidence score. The prefix(3) method grabs elements 0, 1, and 2 from this array (the ones with the highest probabilities), and you use map to turn them into strings.

For the above probability distribution, this gives:

element 0: carrot 72%
element 1: orange 15%
element 2: ice cream 8%

The model is fairly confident about this prediction. The first result has a pretty high score, and so you can probably believe that the image really is of a carrot.

The second result is often fairly reasonable — if you squint, an orange could look like a carrot — but the third result and anything below it can be way off the mark.

Given these confidence scores, that’s OK; the model really didn’t think ice cream was a reasonable guess here at only 8% confidence.

Note: The percentages of these top three choices don’t have to add up to 100%, since there are another 17 classes that will make up the remainder.

Notice that, when you made these changes to the code, you removed the if statement that checked whether the confidence was less than 80%.

That check made sense for a binary classifier but, when you have multiple classes, the best confidence will often be around the 60% mark. That’s still a pretty confident score.

With a binary classifier and two classes, a random guess is correct 50% of the time. But with 20 classes, a random guess would be correct only 1/20th, or 5%, of the time.

When the multi-class model is very unsure about what is in the image, the probability distribution would look more like this:

You could still add a kind of “not sure” threshold, but a more reasonable value would be 0.4, or 40%, instead of the 80% that you used with the binary classifier.

Still, just like a binary classifier, the predictions from a multi-class model only make sense if you show it the types of objects that it has been trained to recognize.

If you give the new classifier an image of something that is not one of the 20 kinds of snacks it knows about, such as a dachshund, the model may return a very unsure prediction (“it could be anything”) or a very confident but totally wrong prediction (“it’s a hot dog”).

Again, you can ask what happens when an image contains objects of more than one class?

Well, unlike with the binary classifier in which predictions became very uncertain (50–50), a similar thing happens but now the probabilities get divided over more classes:

In this example, the classifier correctly recognizes apples and carrots as the top choices, and it tries to split the probabilities between them.

This is why you’re looking at the top three results instead of just the single best score. In image classification competitions, classifiers are usually scored on how well they do on their five best guesses since, that way, you can deal with one image containing more than one object or with objects that are a little ambiguous. As long as the correct answer is among the best five (or three) guesses, we’re happy.

The top-one accuracy says, “Did the classifier get the most important object right?” while the top-three or top-five accuracy says, “Did it find all of the important objects?” For example, if an image that scored orange 70%, watermelon 21%, and muffin 3% really contained a watermelon and not an orange, it would still be counted as a correct classification.

Note: Don’t confuse multi-class with “multi-label.” A multi-class classifier’s job is to choose a single category for an object from multiple categories. A multi-label classifier’s job is to choose as many categories as applicable for the same object. For example, a multi-label snacks classifier could classify an apple as “healthy”, “fruit”, and “red”.

Bonus: Using Core ML without Vision

You’ve seen how easy it is to use Core ML through the Vision framework. Given the amount of work Vision does for you already, it’s recommended to always use Vision when you’re working with image models. However, it is also possible to use Core ML without Vision, and in this section you’ll see how to do so.

For this section, use the starter project again and add the HealthySnacks.mlmodel to the project.

First, take a detailed look at the auto-generated code, since you’ll use this shortly. To see this source file, first click on HealthySnacks.mlmodel in the Project navigator and then click on the little arrow next to the model name in the “Model Class” section.

This opens HealthySnacks.swift, a special source code file that doesn’t actually exist anywhere in the project.

The main class in this source file is HealthySnacks (located near the bottom of the file). It has a single property named model, which is an instance of MLModel, the main class in the Core ML framework. init loads the .mlmodel from the main bundle.

There are two prediction methods. The first of these takes an object of type HealthySnacksInput and returns a HealthySnacksOutput. The second one is a convenience method that takes a CVPixelBuffer object as input instead. Notice that there are no methods that accept a CGImage or a CIImage like with Vision.

Both HealthySnacksInput and HealthySnacksOutput are classes that implement the MLFeatureProvider protocol. Remember from the previous chapter that “feature” is the term we use for any value that we use for learning. An MLFeatureProvider is an object that describes such features to Core ML.

In the case of the HealthySnacksInput, there is just one feature: an image in the form of a CVPixelBuffer object that is 227 pixels width and 227 pixels high. Actually the model will treat each R/G/B value in this image as a separate feature, so, technically speaking, the model has 227 × 227 × 3 input features.

The HealthySnacksOutput class provides two features containing the outputs of the model: a dictionary called labelProbability and a string called simply label. The label is simply the name of the class with the highest probability and is provided for convenience.

The dictionary contains the names of the classes and the confidence score for each class, so it’s the same as the probability distribution but in the form of a dictionary instead of an array. The difference with Vision’s array of VNClassificationObservation objects is that the dictionary is not sorted.

Note: The names that Xcode generates for these properties depend on the names of the inputs and outputs in the .mlmodel file. For this particular model, the input is called “image” and so the method becomes prediction(image:). If the input were called something else in the .mlmodel file, such as “data,” then the method would be prediction(data:). The same is true for the names of the outputs in the HealthySnacksOutput class. This is something to be aware of when you’re importing a Core ML model: different models will have different names for the inputs and outputs — another thing you don’t have to worry about when using Vision.

In order to use the HealthySnacks class without Vision, you have to call its prediction(image:) method and give it a CVPixelBuffer containing the image to classify. When the prediction method is done it returns the classification result as a HealthySnacksOutput object.

Next, you’ll write this code. Switch to ViewController.swift and add the following property to ViewController to create an instance of the model:

let healthySnacks = HealthySnacks()

Now, you need a way to convert the UIImage from UIImagePickerController into a CVPixelBuffer. This object is a low-level description of image data, used by Core Video and AVFoundation. You’re probably used to working with images as UIImage or CGImage objects, and so you need to convert these to CVPixelBuffers, first.

Add the following function to the class:

func pixelBuffer(for image: UIImage)-> CVPixelBuffer? {
  let model = healthySnacks.model

  let imageConstraint = model.modelDescription
                             .inputDescriptionsByName["image"]!
                             .imageConstraint!

  let imageOptions: [MLFeatureValue.ImageOption: Any] = [
    .cropAndScale: VNImageCropAndScaleOption.scaleFill.rawValue
  ]

  return try? MLFeatureValue(
    cgImage: image.cgImage!,
    constraint: imageConstraint,
    options: imageOptions).imageBufferValue
}

The constraint is an MLImageConstraint object that describes the image size that is expected by the model input. The options dictionary lets you specify the how the image gets resized and cropped. This uses the same options as Vision, but you can also give it a CGRect with a custom cropping region. There is also a version of this MLFeatureValue constructor that lets you pass in an orientation value for the image if it is not upright.

Note: This API is only available from iOS 13 onward. In the downloads for this chapter, we’ve provided a UIImage extension that converts the UIImage to a CVPixelBuffer for older versions of iOS.

Change the classify(image:) method to the following:

func classify(image: UIImage) {
  DispatchQueue.global(qos: .userInitiated).async {
    // 1
    if let pixelBuffer = self.pixelBuffer(for: image) {
      // 2
      if let prediction = try? self.healthySnacks.prediction(
        image: pixelBuffer) {
        // 3
        let results = self.top(1, prediction.labelProbability)
        self.processObservations(results: results)
      } else {
        self.processObservations(results: [])
      }
    }
  }
}

Here’s how this works:

Convert the UIImage to a CVPixelBuffer using the helper method. This scales the image to the expected size (227×227) and also fixes the orientation if it’s not correct side up yet.

Call the prediction(image:) method. This can potentially fail — if the image buffer is not 227×227 pixels, for example — which is why you need to use try? and put it inside the if let.

The prediction object is an instance of HealthySnacksOutput. You can look at its label property to find the name of the best class, but you want to look at the names of the best scoring classes as well as their probabilities. That’s what the self.top function does.

Because MLModel’s prediction method is synchronous, it blocks the current thread until it’s done. For this simple image classifier, that may not be a big deal as it’s fairly fast, but it’s good practice to do the prediction on a background queue anyway.

Xcode now gives errors because the code calls two methods you still need to add. First, add the top() method:

func top(_ k: Int, _ prob: [String: Double])
  -> [(String, Double)] {
  return Array(prob.sorted { $0.value > $1.value }
                   .prefix(min(k, prob.count)))
}

This looks at the dictionary from prediction.labelProbability and returns the k best predictions as an array of (String, Double) pairs where the string is the label (name of the class) and the Double is the probability / confidence score for that class.

Currently you’re calling top(1, …) because, for the HealthySnacks model, you only care about the highest-scoring class. For the MultiSnacks model, you might call top(3, …) to get the three best results.

Finally, you can put these (String, Double) pairs into a string to show in the results label:

func processObservations(
        results: [(identifier: String, confidence: Double)]) {
  DispatchQueue.main.async {
    if results.isEmpty {
      self.resultsLabel.text = "nothing found"
    } else if results[0].confidence < 0.8 {
      self.resultsLabel.text = "not sure"
    } else {
      self.resultsLabel.text = String(
        format: "%@ %.1f%%",
        results[0].identifier,
        results[0].confidence * 100)
    }
    self.showResultsView()
  }
}

This is very similar to what you did in the Vision version of the app but the results are packaged slightly differently.

So this actually wasn’t too bad, was it? It may even seem like a bit less work than what you had to do for Vision. But this is a little misleading… There are a few important things the pure Core ML version does not do yet, such as color space matching; this translates from the photo’s color space, which is often sRGB or P3 or even YUV, to the generic RGB space used by the model.

Challenge

Challenge 1: Add SqueezNet model to the app

Apple provides a number of Core ML models that you can download for free, from https://developer.apple.com/machine-learning/models/.

Your challenge for this chapter is to download the SqueezeNet model and add it to the app. This model is very similar to the classifier you implemented in this chapter, which is also based on SqueezeNet. The main difference is that HealthySnacks is trained to classify 20 different snacks into two groups: healthy or unhealthy. The SqueezeNet model from Apple is trained to understand 1,000 classes of different objects (it’s a multi-class classifier).

Try to add this new model to the app. It should only take the modification of a single line to make this work — that’s how easy it is to integrate Core ML models into your app because they pretty much work all the same.

Key points

To recap, doing image classification with Core ML and Vision in your app involves the following steps:

Obtain a trained .mlmodel file from somewhere. You can sometimes find pre-trained models on the web (Apple has a few on its website) but usually you’ll have to build your own. You’ll learn how to do this in the next chapter.

Add the .mlmodel file to your Xcode project.

Create the VNCoreMLRequest object (just once) and give it a completion handler that looks at the VNClassificationObservation objects describing the results.

For every image that you want to classify, create a new VNImageRequestHandler object and tell it to perform the VNCoreMLRequest.

These steps will work for any kind of image classification model. In fact, you can copy the code from this chapter and use it with any Core ML image classifier.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Sign up/Sign in

With a free Kodeco account you can download source code, track your progress, bookmark, personalise your learner profile and more!
Create account
Already a member of Kodeco? Sign in

Chapters

Machine Learning by Tutorials

Before You Begin

Section I: Machine Learning with Images

Section II: Machine Learning with Sequences

Section III: Natural Language Processing

2. Getting Started with Image Classification
Written by Matthijs Hollemans

Is that snack healthy?

Core ML

Vision

Creating the VNCoreML request

Crop and scale options

Performing the request

Image orientation

Trying it out

Showing the results

What if the image doesn’t have a snack?

What if there’s more than one object in the image?

How does it work?

Into the next dimension

A concrete example

Multi-class classification

Bonus: Using Core ML without Vision

Challenge

Challenge 1: Add SqueezNet model to the app

Key points

Chapters

Machine Learning by Tutorials

Before You Begin

Section I: Machine Learning with Images

Section II: Machine Learning with Sequences

Section III: Natural Language Processing

Is that snack healthy?

Core ML

Vision

Creating the VNCoreML request

Crop and scale options

Performing the request

Image orientation

Trying it out

Showing the results

What if the image doesn’t have a snack?

What if there’s more than one object in the image?

How does it work?

Into the next dimension

A concrete example

Multi-class classification

Bonus: Using Core ML without Vision

Challenge

Challenge 1: Add SqueezNet model to the app

Key points

Sign up/Sign in

Access this book