Home iOS & Swift Books Machine Learning by Tutorials

2
Getting Started with Image Classification Written by Matthijs Hollemans

Let’s begin your journey into the world of machine learning by creating a binary image classifier.

A classifier is a machine learning model that takes an input of some kind, in this case an image, and determines what sort of “thing” that input represents. An image classifier tells you which category, or class, the image belongs to.

Binary means that the classifier is able to distinguish between two classes of objects. For example, you can have a classifier that will answer either “cat” or “dog” for a given input image, just in case you have trouble telling the two apart.

A binary classifier for cats and dogs
A binary classifier for cats and dogs

Being able to tell the difference between only two things may not seem very impressive, but binary classification is used a lot in practice.

In medical testing, it determines whether a patient has a disease, where the “positive” class means the disease is present and the “negative” class means it’s not. Another common example is filtering email into spam/not spam.

There are plenty of questions that have a definite “yes/no” answer, and the machine learning model to use for such questions is a binary classifier. The cats-vs.-dogs classifier can be framed as answering the question: “Is this a picture of a cat?” If the answer is no, it’s a dog.

Image classification is one of the most fundamental computer vision tasks. Advanced applications of computer vision — such as object detection, style transfer, and image generation — all build on the same ideas from image classification, making this a great place to start.

There are many ways to create an image classifier, but by far the best results come from using deep learning. The success of deep learning in image classification is what started the current hype around AI and ML. We wouldn’t want you to miss out on all this exciting stuff, and so the classifier you’ll be building in this chapter uses deep learning under the hood.

Is that snack healthy?

In this chapter you’ll learn how to build an image classifier that can tell the difference between healthy and unhealthy snacks.

To get started, make sure you’ve downloaded the supplementary materials for this chapter and open the HealthySnacks starter project in Xcode.

This is a very basic iPhone app with two buttons, an image view, and a text label at the top:

The design of the app
The design of the app

The “picture frame” button on the left lets you choose a photo from the library using UIImagePickerController. The “camera” button on the right lets you take a picture with the camera (this button is disabled in the simulator).

Once you’ve selected a picture, the app calls classify(image:) in ViewController.swift to decide whether the image is of a healthy snack or not. Currently this method is empty. In this chapter you’ll be adding code to this method to run the classifier.

At this point, it’s a good idea to take a brief look at ViewController.swift to familiarize yourself with the code. It’s pretty standard fare for an iOS app.

In order to do machine learning on the device, you need to have a trained model. For the HealthySnacks app, you’ll need a model that has learned how to tell apart healthy snacks from unhealthy snacks. In this chapter you’ll be using a ready-made model that has already been trained for you, and in the next chapter you’ll learn to how train this model yourself.

The model is trained to recognize the following snacks:

The categories of snacks
The categories of snacks

For example, if you point the camera at an apple and snap a picture, the app should say “healthy”. If you point the camera at a hotdog, it should say “unhealthy”.

What the model actually predicts is not just a label (“healthy” or “unhealthy”) but a probability distribution, where each classification is given a probability value:

An example probability distribution
An example probability distribution

If your math and statistics are a little rusty, then don’t let terms such as “probability distribution” scare you. A probability distribution is simply a list of positive numbers that add up to 1.0. In this case it is a list of two numbers because this model has two classes:

[0.15, 0.85]

The above prediction is for an image of a waffle with strawberries on top. The model is 85% sure that the object in this picture is unhealthy. Because the predicted probabilities always need to add up to 100% (or 1.0), this outcome also means the classifier is 15% sure this snack is healthy — thanks to the strawberries.

You can interpret these probabilities to be the confidence that the model has in its predictions. A waffle without strawberries would likely score higher for unhealthy, perhaps as much as 98%, leaving only 2% for class healthy. The more confident the model is about its prediction, the more one of the probabilities goes to 100% and the other goes to 0%. When the difference between them is large, as in this example, it means that the model is sure about its prediction.

Ideally, you would have a model that is always confident and never wrong. However, sometimes it’s very hard for the model to draw a solid conclusion about the image. Can you tell whether the food in the following image is mostly healthy or unhealthy?

What is this even?
What is this even?

The less confident the model is, the more both probabilities go towards the middle, or 50%.

When the probability distribution looks like the following, the model just isn’t very sure, and you cannot really trust the prediction — it could be either class.

An unconfident prediction
An unconfident prediction

This happens when the image has elements of both classes — salad and greasy stuff — so it’s hard for the model to choose between the two classes. It also happens when the image is not about food at all, and the model does not know what to make of it.

To recap, the input to the image classifier is an image and the output is a probability distribution, a list of numbers between 0 and 1.

Since you’re going to be building a binary classifier, the probability distribution is made up of just two numbers. The easiest way to decide which class is the winner is to choose the one with the highest predicted probability.

Note: To keep things manageable for this book, we only trained the model on twenty types of snacks (ten healthy, ten unhealthy). If you take a picture of something that isn’t in the list of twenty snacks, such as broccoli or pizza, the prediction could be either healthy or unhealthy. The model wasn’t trained to recognize such things and, therefore, what it predicts is anyone’s guess. That said, the model might still guess right on broccoli (it’s green, which is similar to other healthy snacks) and pizza (it’s greasy and therefore unhealthy).

Core ML

For many of the projects in this book, you’ll be using Core ML, Apple’s machine learning framework that was introduced with iOS 11. Core ML makes it really easy to add machine learning models to your app — it’s mostly a matter of dropping a trained model into your app and calling a few API functions.

Looking at the mlmodel file
Muufoks ov gji gnkelit huqu

labelProbability = [ "healthy": 0.15, "unhealthy": 0.85 ]
Click the arrow to view the generated code
Pvexb cbi urzuc ge taup hni fahejidod raxo

Vision

Along with Core ML, Apple also introduced the Vision framework in iOS 11. As you can guess from its name, Vision helps with computer vision tasks. For example, it can detect rectangular shapes and text in images, detect faces and even track moving objects.

Creating the VNCoreML request

To add image classification to the app, you’re going to implement classify(image:) in ViewController.swift. This method is currently empty.

import CoreML
import Vision
lazy var classificationRequest: VNCoreMLRequest = {
  do {
    // 1
    let healthySnacks = HealthySnacks()
    // 2
    let visionModel = try VNCoreMLModel(
      for: healthySnacks.model)
    // 3
    let request = VNCoreMLRequest(model: visionModel,
                                  completionHandler: {
      [weak self] request, error in
      print("Request is finished!", request.results)
    })
    // 4
    request.imageCropAndScaleOption = .centerCrop
    return request
  } catch {
    fatalError("Failed to create VNCoreMLModel: \(error)")
  }
}()

Crop and scale options

It has been mentioned a few times now that the model you’re using, which is based on SqueezeNet, requires input images that are 227×227 pixels. Since you’re using Vision, you don’t really need to worry about this — Vision will automatically scale the image to the correct size. However, there is more than one way to resize an image, and you need to choose the correct method for the model, otherwise it might not work as well as you’d hoped.

The centerCrop option
Nqu jocrujCpin utnoot

The scaleFill and scaleFit options
Hmo dtasiJupx atx zloloWim agjeaxm

Performing the request

Now that you have the request object, you can implement the classify(image:) method. Add the following code to that method:

func classify(image: UIImage) {
  // 1
  guard let ciImage = CIImage(image: image) else {
    print("Unable to create CIImage")
    return
  }
  // 2
  let orientation = CGImagePropertyOrientation(
    image.imageOrientation)
  // 3
  DispatchQueue.global(qos: .userInitiated).async {
    // 4
    let handler = VNImageRequestHandler(
      ciImage: ciImage,
      orientation: orientation)
    do {
      try handler.perform([self.classificationRequest])
    } catch {
      print("Failed to perform classification: \(error)")
    }
  }
}

Image orientation

When you take a photo with the iPhone’s camera, regardless of how you’re holding the phone, the image data is stored as landscape because that’s the native orientation of the camera sensor.

This cat is not right-side up
Ppec yom aq xis quxyv-damo aj

Trying it out

At this point, you can build and run the app and choose a photo.

Request is finished! Optional([<VNClassificationObservation: 0x60c00022b940> B09B3F7D-89CF-405A-ABE3-6F4AF67683BB 0.81705 "healthy" (0.917060), <VNClassificationObservation: 0x60c000223580> BC9198C6-8264-4B3A-AB3A-5AAE84F638A4 0.18295 "unhealthy" (0.082940)])

Showing the results

Inside the declaration of lazy var classificationRequest, change the completion handler for the VNCoreMLRequest object to the following:

let request = VNCoreMLRequest(
  model: visionModel, 
  completionHandler: { [weak self] request, error in
    // add this
    self?.processObservations(for: request, error: error)
  })
func processObservations(
  for request: VNRequest, 
  error: Error?) {
  // 1
  DispatchQueue.main.async {
    // 2
    if let results = request.results 
      as? [VNClassificationObservation] {
      // 3
      if results.isEmpty {
        self.resultsLabel.text = "nothing found"
      } else {
        self.resultsLabel.text = String(
          format: "%@ %.1f%%",
          results[0].identifier,
          results[0].confidence * 100)
      }
    // 4
    } else if let error = error {
      self.resultsLabel.text = 
        "error: \(error.localizedDescription)"
    } else {
      self.resultsLabel.text = "???"
    }
    // 5
    self.showResultsView()
  }
}
Predictions on a few test images
Xlojiptoinp ew a ven poyg omekim

What if the image doesn’t have a snack?

The app shows the winning class and the confidence it has in this prediction. In the above image on the left, the class is “healthy” and the confidence is 94.8%.

element 0: healthy 51%
element 1: unhealthy 49%
element 0: unhealthy 52%
element 1: healthy 48%
if results.isEmpty {
  . . .
} else if results[0].confidence < 0.8 {
  self.resultsLabel.text = "not sure"
} else {
  . . .
Yeah, I wouldn’t eat this either
Kuas, I hiimzf’n iuk nlux eofdub

What if there’s more than one object in the image?

Image classification always looks at the entire image and tries to find out what the most prominent object in the image is. But nothing stops you from running an image classifier on a picture containing objects from more than one class:

Make up your mind!
Rude ov jait vezj!

How does it work?

At this point, you may be wondering exactly how this Core ML model is able to tell apart healthy snacks from unhealthy snacks. The model takes an image as input and produces a probability distribution as output, but what is the magic that makes this happen? Let’s peek under the hood a little.

let outputs = myModel(inputs)
func myModel(input: Image) -> [Double] {
  // a lot of fancy math
}
The model is a pipeline
Bwe gigin oc o benalige

func myModel(input: Image) -> [Double] {
  return input.map({...}).filter({...}).map({...}).reduce({...})
}
Part of the SqueezeNet pipeline
Rulb ey nqi VveueduDah fumapuge

Into the next dimension

Let’s dive a little deeper into the math, just so you get a better conceptual idea of what these transformations do. Neural networks, like most machine learning models, can only work with numerical data. Fortunately for us, the data we care about in this chapter — the input image and the output probabilities — are all represented as numbers already. Models that work on data such as text would first need to convert that data into numbers.

Pretend this is 150,000 dimensions
Lkupafq nyoy en 289,129 gojidboatk

The decision boundary divides up the space into two classes
Kne saqiqeot faodzagm begugag uh bzi sgena upxi kwe kmuwhig

A concrete example

Here is a famous example that should illustrate the idea. In this example the data is two-dimensional, so each input consists of only two numbers (x, y). This is also a binary classification problem, but in the original coordinate space it’s impossible to draw a straight line between the two classes:

An impossible classification problem…
Af uwwigdelje jbudromusiruam nvowtin…

…but easy after transforming the data
…biy iofy otpiq gziywfumyemk nya xire

Multi-class classification

So far, we’ve covered binary classification in which there are only two classes, but it’s also really easy to use a model that can handle multiple classes. This is called… wait for it… a multi-class classifier — or, sometimes, a multinomial classifier.

Recognizing multiple classes
Wakigpidabs wuxyavqu bqujfef

lazy var classificationRequest: VNCoreMLRequest = {
  do {
    let multiSnacks = MultiSnacks()
    let visionModel = try VNCoreMLModel(for: multiSnacks.model)
    . . .
if results.isEmpty {
  self.resultsLabel.text = "nothing found"
} else {
  let top3 = results.prefix(3).map { observation in
    String(format: "%@ %.1f%%", observation.identifier,
           observation.confidence * 100)
  }
  self.resultsLabel.text = top3.joined(separator: "\n")
}
The new probability distribution
Fra mof zwuroqasacg gontfohovoiq

element 0: carrot 72%
element 1: orange 15%
element 2: ice cream 8%
When the multi-class model is unsure
Jnix nra juxsu-qdezm kefoq uk abcako

Image with multiple types of fruit
Aholi xajs kegboxbi rnzod ey xwail

Bonus: Using Core ML without Vision

You’ve seen how easy it is to use Core ML through the Vision framework. Given the amount of work Vision does for you already, it’s recommended to always use Vision when you’re working with image models. However, it is also possible to use Core ML without Vision, and in this section you’ll see how to do so.

Viewing the generated code
Qaodakm spi fevivodat viqo

let healthySnacks = HealthySnacks()
func pixelBuffer(for image: UIImage)-> CVPixelBuffer? {
  let model = healthySnacks.model

  let imageConstraint = model.modelDescription
                             .inputDescriptionsByName["image"]!
                             .imageConstraint!

  let imageOptions: [MLFeatureValue.ImageOption: Any] = [
    .cropAndScale: VNImageCropAndScaleOption.scaleFill.rawValue
  ]

  return try? MLFeatureValue(
    cgImage: image.cgImage!,
    constraint: imageConstraint,
    options: imageOptions).imageBufferValue
}
func classify(image: UIImage) {
  DispatchQueue.global(qos: .userInitiated).async {
    // 1
    if let pixelBuffer = self.pixelBuffer(for: image) {
      // 2
      if let prediction = try? self.healthySnacks.prediction(
        image: pixelBuffer) {
        // 3
        let results = self.top(1, prediction.labelProbability)
        self.processObservations(results: results)
      } else {
        self.processObservations(results: [])
      }
    }
  }
}
func top(_ k: Int, _ prob: [String: Double]) 
  -> [(String, Double)] {
  return Array(prob.sorted { $0.value > $1.value }
                   .prefix(min(k, prob.count)))
}
func processObservations(
        results: [(identifier: String, confidence: Double)]) {
  DispatchQueue.main.async {
    if results.isEmpty {
      self.resultsLabel.text = "nothing found"
    } else if results[0].confidence < 0.8 {
      self.resultsLabel.text = "not sure"
    } else {
      self.resultsLabel.text = String(
        format: "%@ %.1f%%",
        results[0].identifier,
        results[0].confidence * 100)
    }
    self.showResultsView()
  }
}

Challenge

Challenge 1: Add SqueezNet model to the app

Apple provides a number of Core ML models that you can download for free, from https://developer.apple.com/machine-learning/models/.

Key points

To recap, doing image classification with Core ML and Vision in your app involves the following steps:

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Have feedback to share about the online reading experience? If you have feedback about the UI, UX, highlighting, or other features of our online readers, you can send them to the design team with the form below:

© 2020 Razeware LLC

You're reading for free, with parts of this chapter shown as obfuscated text. Unlock this book, and our entire catalogue of books and videos, with a raywenderlich.com Professional subscription.

Unlock Now

To highlight or take notes, you’ll need to own this book in a subscription or purchased by itself.