ML Kit Tutorial for iOS: Recognizing Text in Images

In this ML Kit tutorial, you’ll learn how to leverage Google’s ML Kit to detect and recognize text. By David East.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 3 of 4 of this article. Click here to view the first page.

Understanding Image Scaling

The default scanned-text.png image is 654×999 (width x height); however, the UIImageView has a “Content Mode” of “Aspect Fit,” which scales the image to 375×369 in the view. ML Kit receives the actual size of the image and returns the element frames based on that size. The frames from the actual size are then drawn on the scaled size, which produces a confusing result.

Compare actual size vs scaled size

In the picture above, notice the differences between the scaled size and the actual size. You can see that the frames match up on the actual size. To get the frames in the right place, you need to calculate the scale of the image versus the view.

The formula is fairly simple (👀…fairly):

  1. Calculate the resolutions of the view and image.
  2. Determine the scale by comparing resolutions.
  3. Calculate height, width, and origin points x and y, by multiplying them by the scale.
  4. Use those data points to create a new CGRect.

If that sounds confusing, it’s OK! You’ll understand when you see the code.

Calculating the Scale

Open ScaledElementProcessor.swift and add the following method:

// 1
private func createScaledFrame(
  featureFrame: CGRect, 
  imageSize: CGSize, viewFrame: CGRect) 
  -> CGRect {
  let viewSize = viewFrame.size
    
  // 2
  let resolutionView = viewSize.width / viewSize.height
  let resolutionImage = imageSize.width / imageSize.height
    
  // 3
  var scale: CGFloat
  if resolutionView > resolutionImage {
    scale = viewSize.height / imageSize.height
  } else {
    scale = viewSize.width / imageSize.width
  }
    
  // 4
  let featureWidthScaled = featureFrame.size.width * scale
  let featureHeightScaled = featureFrame.size.height * scale
    
  // 5
  let imageWidthScaled = imageSize.width * scale
  let imageHeightScaled = imageSize.height * scale
  let imagePointXScaled = (viewSize.width - imageWidthScaled) / 2
  let imagePointYScaled = (viewSize.height - imageHeightScaled) / 2
    
  // 6
  let featurePointXScaled = imagePointXScaled + featureFrame.origin.x * scale
  let featurePointYScaled = imagePointYScaled + featureFrame.origin.y * scale
    
  // 7
  return CGRect(x: featurePointXScaled,
                y: featurePointYScaled,
                width: featureWidthScaled,
                height: featureHeightScaled)
  }

Here’s what’s going on in the code:

  1. This method takes in CGRects for the original size of the image, the displayed image size and the frame of the UIImageView.
  2. The resolutions of the image and view are calculated by dividing their heights and widths respectively.
  3. The scale is determined by which resolution is larger. If the view is larger, you scale by the height; otherwise, you scale by the width.
  4. This method calculates width and height. The width and height of the frame are multiplied by the scale to calculate the scaled width and height.
  5. The origin of the frame must be scaled as well; otherwise, even if the size is correct, it would be way off center in the wrong position.
  6. The new origin is calculated by adding the x and y point scales to the unscaled origin multiplied by the scale.
  7. A scaled CGRect is returned, configured with calculated origin and size.

Now that you have a scaled CGRect, you can go from scribbles to sgraffito. Yes, that’s a thing. Look it up and thank me in your next Scrabble game.

Go to process(in:callback:) in ScaledElementProcessor.swift and modify the innermost for loop to use the following code:

for element in line.elements {
  let frame = self.createScaledFrame(
    featureFrame: element.frame,
    imageSize: image.size, 
    viewFrame: imageView.frame)
  
  let shapeLayer = self.createShapeLayer(frame: frame)
  let scaledElement = ScaledElement(frame: frame, shapeLayer: shapeLayer)
  scaledElements.append(scaledElement)
}

The newly added line creates a scaled frame, which the code uses to create the correctly position shape layer.

Build and run. You should see the frames drawn in the right places. What a master painter you are!

Frames that are scaled to the image

Enough with default photos; time to capture something from the wild!

Taking Photos with the Camera

The project has the camera and library picker code already set up in an extension at the bottom of ViewController.swift. If you try to use it right now, you’ll notice that none of the frames match up. That’s because it’s still using the old frames from the preloaded image! You need to remove those and draw new ones when you take or select a photo.

Add the following method to ViewController:

private func removeFrames() {
  guard let sublayers = frameSublayer.sublayers else { return }
  for sublayer in sublayers {
    sublayer.removeFromSuperlayer()
  }
}

This method removes all sublayers from the frame sublayer using a for loop. This gives you a clean canvas for the next photo.

To consolidate the detection code, add the following new method to ViewController:

// 1
private func drawFeatures(
  in imageView: UIImageView, 
  completion: (() -> Void)? = nil
  ) {
  // 2
  removeFrames()
  processor.process(in: imageView) { text, elements in
    elements.forEach() { element in
      self.frameSublayer.addSublayer(element.shapeLayer)
    }
    self.scannedText = text
    // 3
    completion?()
  }
}

Here’s what changed:

  1. This methods takes in the UIImageView and a callback so that you know when it’s done.
  2. Frames are automatically removed before processing a new image.
  3. Trigger the completion callback once everything is done.

Now, replace the call to processor.process(in:callback:) in viewDidLoad() with the following:

drawFeatures(in: imageView)

Scroll down to the class extension and locate imagePickerController(_:didFinishPickingMediaWithInfo:); add this line of code to the end of the if block, after imageView.image = pickedImage:

drawFeatures(in: imageView)

When you shoot or select a new photo, this code ensures that the old frames are removed and replaced by the ones from the new photo.

Build and run. If you’re on a real device (not a simulator), take a picture of printed text. You might see something strange:

Gibberish text detection

What’s going on here?

You’ll cover image orientation in a second, because the above is an orientation issue.

Dealing With Image Orientations

This app is locked in portrait orientation. It’s tricky to redraw the frames when the device rotates, so it’s easier to restrict the user for now.

This restriction requires the user to take portrait photos. The UICameraPicker rotates portrait photos 90 degrees behind the scenes. You don’t see the rotation because the UIImageView rotates it back for you. However, what the detector gets is the rotated UIImage.

The rotated photo

This leads to some confusing results. ML Kit allows you to specify the orientation of the photo in the VisionMetadata object. Setting the proper orientation will return the correct text, but the frames will be drawn for the rotated photo.

This is how ML Kit sees the photo, so the frames are drawn incorrectly.

Therefore, you need to fix the photo orientation to always be in the “up” position. The project contains an extension named +UIImage.swift. This extension adds a method to UIImage that changes the orientation of any photo to the up position. Once the photo is in the correct orientation, everything will run smoothly!

Open ViewController.swift and, in imagePickerController(_:didFinishPickingMediaWithInfo:), replace imageView.image = pickedImage with the following:

// 1
let fixedImage = pickedImage.fixOrientation()
// 2
imageView.image = fixedImage

Here’s what changed:

  1. The newly selected image, pickedImage, is rotated back to the up position.
  2. Then, you assign the rotated image to the imageView.

Build and run. Take that photo again. You should see everything in the right place.

Working ML Kit frames