Natural Language Processing on iOS with Turi Create

In this Natural Language Processing tutorial, you’ll learn how to train a Core ML model from scratch, then use that model within an iOS Application By Michael Katz.

Leave a rating/review
Download materials
Save for later
Share

Natural Language Processing, or NLP, is the discipline of taking unstructured text and discerning some characteristics about it. To help you do this, Apple’s operating systems provide functions to understand text using the following techniques:

  • Language identification
  • Lemmatization (identification of the root form of a word)
  • Named entity recognition (proper names of people, places and organizations)
  • Parts of speech identification
  • Tokenization

In this tutorial, you’ll use tokenization and a custom machine learning model, or ML model, to identify the author of a given poem, or at least the poet the poem most closely emulates.

Note: This tutorial assumes you’re already familiar with the basics of iOS development and Swift. If you’re new to either of these topics, check out our iOS Development and Swift Language tutorials.

Getting Started

You may have already run into NLP in apps where sections of text are automatically turned into links or tags, or when text is automatically analyzed for emotional charge (called “sentiment” in the biz). NLP is widely used by Apple for data detectors, keyboard auto-suggest and Siri suggestions. Apps with search capabilities often use NLP to find related information and to efficiently transform input text into a canonical and, therefore indexable, form.

The app you’ll be building for this tutorial will take a poem and compare its text against a Core ML model that’s trained with the words from poems of famous, i.e. public-domain, authors. To build the Core ML model, you’ll be using Turi Create, an open-source Python project from Apple that creates and trains ML models.

Turi Create won’t make you a machine learning expert because it hides almost all the internal workings and mathematics involved. On the plus side, it means you don’t have to be a machine learning expert to use machine learning algorithms in your app!

App Overview

The first thing you need to do is download the starter app. You can find the Download Materials link at the top or bottom of this tutorial.

Inside the downloaded materials, you’ll find two project folders, a JSON file, and a Core ML model. Don’t worry about the JSON and ML model files; you’ll use those a bit later. Open the KeatsOrYeats-starter folder and fire up the KeatsOrYeats.xcodeproj inside.

Once running, copy and paste your favorite Yeats poem. Here is an example, “On Being Asked for a War Poem,” by William Butler Yeats:

I think it better that in times like these
A poet's mouth be silent, for in truth
We have no gift to set a statesman right;
He has had enough of meddling who can please
A young girl in the indolence of her youth,
Or an old man upon a winter’s night.

Press Return to run the analysis. At the top, you’ll see the app’s results, indicating “P. Laureate” wrote the poem! This prediction has a 50% confidence and a 10% confidence the poem matches the works of “P. Inglorious”.

This is obviously not correct, but that’s because the results are hard-coded and there’s no actual analysis.

Download Every Known Poem

Sometimes it’s useful to start developing an app using a simple, brute-force approach. The first-order solution to author identification is to get a copy of every known poem, or at least known poems by a set list of poets. That way, the app can do a simple string compare and see if the poem matches any of the authors. As Robert Burns once said, “Easy peasy.”

Nice try, but there are two major problems. First, poems (especially older ones) don’t always have canonical formatting (like line breaks, spacing and punctuation) so it’s hard to do a blind string compare. Second, your full-featured app should identify which author the entered poem most resembles, even if that isn’t a poem known to the app or not a work by that author.

There’s got to be a better way… And there is! Machine learning lets you create models of text you can then use to classify never-before-seen text into a known category.

Intro to Machine Learning: Text-Style

There are many different algorithms covered under the umbrella of machine learning. This CGPGrey video gives an excellent layman’s introduction.

The main takeaway is the resulting model is a mathematical black box that takes input text, transforms that input, and the result will be a decision or, in this a case, a probability that the text matches a given author. Inside that box is a series of weighted values that compute that probability. These weights are “discovered” (refined) over a series of epochs where the weights are adjusted to reduce the overall error.

The simplest model is a linear regression, which fits a line to a series of points. You may be familiar with the old equation y = mx + b. In this case, you have a series of known x & y’s, and the training of the model is to figure out the “m” (the weights) and “b”.

In a standard training scenario, there will be a guess for m & b, an error computed and, then over successive epochs, those get nudged closer and closer to find a value that minimizes the error. When presented with a never-before-seen “x”, the model can predict what the “y” value will be. Here is an in-depth article on how it works with Turi Create.

Of course, real-world models are far more complicated and take into account many different input variables.

Bag of Words

Machine learning inspects and analyzes an input’s features. Features in this context are the important or salient values about the input or, mathematically speaking, the independent variables in the computation. From the download materials, go ahead and open corpus.json, which will be the input file for training the model. Inside, you’ll see an array of JSON objects. Take a look at the first item:

{
    "title": "When You Are Old",
    "author": "William Butler Yeats",
    "text": "When you are old and grey and full of sleep,\nAnd nodding by the fire, take down this book,\nAnd slowly read, and dream of the soft look\nYour eyes had once, and of their shadows deep;\nHow many loved your moments of glad grace,\nAnd loved your beauty with love false or true,\nBut one man loved the pilgrim Soul in you,\nAnd loved the sorrows of your changing face;\nAnd bending down beside the glowing bars,\nMurmur, a little sadly, how Love fled\nAnd paced upon the mountains overhead\nAnd hid his face amid a crowd of stars."
}

In this case, a single “input” has three columns: title, author and text. The text column will be the only feature for the model, and title is not taken into account. The author is the class the model is tasked with computing, which is sometimes called the label or dependent variable.

If the whole text is used as the input, then the model basically becomes the naïve straight-up comparison discussed above. Instead, specific aspects of the text have to be fed into the model. The default way of handling text is as a bag of words, or BOW. Imagine breaking up all the text into its individual words and throwing them into a bag so they lose their context, ordering and sentence structure. This way, the only dimension that’s retained is the frequency of the collection of words.

In other words, the BOW is a map of words to word counts.

For this tutorial, each poem gets transformed into a BOW, with the assumption that one author will use similar words across different poems, and that other authors will tend toward different word choices.

Each word then becomes a dimension for optimizing the model. In this paltry example of 518 poems, there are 24,939 different words used.