Home iOS & Swift Books Machine Learning by Tutorials

15
Natural Language Transformation, Part 1 Written by Alexis Gallagher

The previous chapter showed you how to use Apple’s Natural Language framework to perform some useful NLP tasks. But Apple only covers the basics — there are many other things you might like to do with natural language. For example, you might answer questions, summarize documents or translate between languages.

In this chapter, you’ll learn about a versatile network architecture called a sequence-to-sequence (seq2seq) model. You’ll add one to the SMDB app you already built, using it to translate movie reviews from Spanish to English, but the same network design has been used for many types of problems, from question answering to generating image captions. Don’t worry if you didn’t already make SMDB — we provide a starter project if you need it. But you can forget about Xcode for a while — seq2seq models require a lower-level framework, so you’ll work with Python and Keras for most of this chapter.

Getting started

Some of the Keras code in this project was initially based on the example found in the file examples/lstm_seq2seq.py inside the Keras GitHub repository github.com/keras-team/keras. This chapter makes stylistic modifications, explains and expands on the code, and shows how to convert the models you build to Core ML and use them in an app.

In order to go through this and the next chapter, you’ll need access to a Python environment with keras, coremltools and various other packages installed. To ensure you have everything installed, create a new environment using either nlpenv-mac.yml or nlpenv-linux.yml, which you’ll find in projects/notebooks. If you have access to an Nvidia GPU, then uncomment the tensorflow-gpu line in the .yml file to greatly increase training speed.

Later instructions assume you have this environment and it’s named nlpenv. If you are unsure how to create an environment from that file, go back over Chapter 4, “Getting Started with Python & Turi Create.”

Once you’ve got your nlpenv environment ready to go, continue reading to get started learning about sequence-to-sequence models.

The sequence-to-sequence model

Inside the chapter resources, you’ll find a text file named spa.txt in projects/notebooks/data/. This file comes originally from manythings.org at http://www.manythings.org/anki/, which provides sentence pairs for many different languages. These pairs were culled from an even larger dataset provided by the Tatoeba Project at www.tatoeba.org.

The first seven lines of spa.txt
Lga yotrk wuguq cabap ad hzi.xyr

Encoder-decoder models

There are multiple ways to accomplish this task. The network architecture you’ll use here is called a sequence-to-sequence, or seq2seq, model. At it’s most basic level, it works like this:

Text translation with seq2seq model
Likr ctoglxudoig dipw zol7xaj gemoy

Seq2seq in depth

Digging a little deeper, the seq2seq model works with sequences both for inputs and outputs. That’s where it gets its name — it transforms a sequence to another sequence. To accomplish this, the encoder and decoder usually both rely on recurrent layers — specifically, this chapter uses the LSTM layer introduced in the sequence classification chapter.

Inference with seq2seq, through the first output token
Abdaluzvo cetz gem5bev, ffquagf csu gaqfx aivtup lorem

Decoder portion of seq2seq model during inference
Moxanom dikseup af suf3zop vejis hesedw ofkuvafxu

Teacher forcing

That is how inference works. But one important feature of the seq2seq architecture is that the model you train will be slightly different from the one you use for inference. During training, your model will actually process each sample like this:

Training a seq2seq model
Kseanimp a xil4gan maxig

Prepare your dataset

First, you need to load your dataset. Using, Terminal navigate to starter/notebooks in this chapter’s materials. Activate your nlpenv environment and launch a new Jupyter notebook. Then run a cell with the following code to load the Spanish-English sequence pairs:

# 1
start_token = "\t"
stop_token = "\n"
# 2
with open("data/spa.txt", "r", encoding="utf-8") as f:
  samples = f.read().split("\n")
samples = [sample.strip().split("\t")
           for sample in samples if len(sample.strip()) > 0]
# 3
samples = [(es, start_token + en + stop_token)
           for en, es in samples if len(es) < 45]
The first two samples after loading the dataset
Qqe cuxhd jka vurvboc oxnoh goubulx kza zuzisux

In and out of vocabulary

If you’ve followed along with the book thus far, then you already know it’s best to have separate training, validation, and test sets when building machine learning models. Keras can randomly select samples from your training data to use for validation when you train your model, but you won’t rely on that here.

from sklearn.model_selection import train_test_split

train_samples, valid_samples = train_test_split(
  samples, train_size=.8, random_state=42)
# 1
in_vocab = set()
out_vocab = set()

for in_seq, out_seq in train_samples:
  in_vocab.update(in_seq)
  out_vocab.update(out_seq)
# 2
in_vocab_size = len(in_vocab)
out_vocab_size = len(out_vocab)
print(sorted(in_vocab))
[' ', '!', '"', '$', '%', "'", '(', ')', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '¡', '«', '°', 'º', '»', '¿', 'Á', 'É', 'Ó', 'Ú', 'á', 'è', 'é', 'í', 'ñ', 'ó', 'ö', 'ú', 'ü', 'ś', 'с', '—', '€']
['\t', '\n', ' ', '!', '"', '$', '%', "'", ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '°', 'á', 'ã', 'è', 'é', 'ö', '‘', ''', '₂', '€']
tmp_samples = []
for in_seq, out_seq in valid_samples:
  tmp_in_seq = [c for c in in_seq if c in in_vocab]
  tmp_out_seq = [c for c in out_seq if c in out_vocab]
  tmp_samples.append(
    ("".join(tmp_in_seq), "".join(tmp_out_seq)))
valid_samples = tmp_samples

Build your model

In this section, you’ll use Keras to define a seq2seq model that translates Spanish text into English, one character at a time. Get started by importing the Keras functions you’ll need with the following code:

import keras
from keras.layers import Dense, Input, LSTM, Masking
from keras.models import Model

Building the encoder

Now, let’s start building the model. Run the following code in your notebook to define the encoder portion of the seq2seq model:

# 1
latent_dim = 256
# 2
encoder_in = Input(
  shape=(None, in_vocab_size), name="encoder_in")
# 3
encoder_mask = Masking(name="encoder_mask")(encoder_in)
# 4
encoder_lstm = LSTM(
  latent_dim, return_state=True, recurrent_dropout=0.3,
  name="encoder_lstm")
# 5
_, encoder_h, encoder_c = encoder_lstm(encoder_mask)

Building the decoder

Your decoder definition will look quite similar to that of your encoder, with a few minor but important differences. Run a cell with the following code now:

# 1
decoder_in = Input(
  shape=(None, out_vocab_size), name="decoder_in")
decoder_mask = Masking(name="decoder_mask")(decoder_in)
# 2
decoder_lstm = LSTM(
  latent_dim, return_sequences=True, return_state=True,
  dropout=0.2, recurrent_dropout=0.3, name="decoder_lstm")
# 3
decoder_lstm_out, _, _ = decoder_lstm(
  decoder_mask, initial_state=[encoder_h, encoder_c])
# 4
decoder_dense = Dense(
  out_vocab_size, activation="softmax", name="decoder_out")
decoder_out = decoder_dense(decoder_lstm_out)

Connecting the encoder and decoder

With the encoder and decoder defined, run the following code to combine them into a seq2seq model:

# 1
seq2seq_model = Model([encoder_in, decoder_in], decoder_out)
# 2
seq2seq_model.compile(
  optimizer="rmsprop", loss="categorical_crossentropy")
seq2seq_model.summary()
Keras seq2seq model for training
Boxul nas1cix yuyif cub cpaujurr

Train your model

So far, you’ve defined your model’s architecture in Keras and loaded a dataset. But before you can train with that data, you need to do a bit more preparation.

Numericalization

OK, full disclosure: Neural networks can’t process text. It might seem like a bad time to bring this up, well into a chapter about natural language processing with neural networks, but there it is. Remember from what you learned elsewhere in this book: Neural networks are really just a bunch of math, and that means they only work with numbers. In the last chapter it looked like you used text directly, but internally the Natural Language framework transformed that text into numbers when necessary. This process is sometimes called numericalization. Now you’ll learn one way to perform such conversions yourself.

# 1
in_token2int = {token : i
                for i, token in enumerate(sorted(in_vocab))}
# 2
out_token2int = {token : i
                 for i, token in enumerate(sorted(out_vocab))}
out_int2token = {i : token
                 for token, i in out_token2int.items()}

One-hot encoding

While neural networks require numeric input, they don’t want just any numbers. In this case, the numbers are stand-ins for text. But if you use these values as is, it will confuse the network because it appears as though some ordinal relationship exists that doesn’t. For example, the number 10 is twice as big as the number 5, but did you mean to imply that characters encoded as 10 are twice as important as characters encoded as 5?

One-hot encoded values
Uji-nib utxecaq xidiof

Batching and padding

To keep things in more manageable chunks, you’ll split the logic to one-hot encode training batches into two functions. The first will create appropriately sized NumPy arrays filled with zeros, and the second will place ones into those arrays at the correct locations to encode the sequences.

import numpy as np

def make_batch_storage(batch_size, in_seq_len, out_seq_len):
  enc_in_seqs = np.zeros(
    (batch_size, in_seq_len, in_vocab_size),
    dtype=np.float32)
  dec_in_seqs = np.zeros(
    (batch_size, out_seq_len, out_vocab_size),
    dtype=np.float32)
  dec_out_seqs = np.zeros(
    (batch_size, out_seq_len, out_vocab_size),
    dtype=np.float32)

  return enc_in_seqs, dec_in_seqs, dec_out_seqs
Mixed-length sequences without padding
Wihiv-farthf huzeamzef qittouf xuldevr

Mixed-length sequences with padding
Necag-kawwpm cejoilpeq nirk jampulg

def encode_batch(samples):
  # 1
  batch_size = len(samples)
  max_in_length = max([len(seq) for seq, _ in samples])
  max_out_length = max([len(seq) for _, seq in samples])

  enc_in_seqs, dec_in_seqs, dec_out_seqs = \
    make_batch_storage(
      batch_size, max_in_length, max_out_length)
  # 2
  for i, (in_seq, out_seq) in enumerate(samples):
    for time_step, token in enumerate(in_seq):
      enc_in_seqs[i, time_step, in_token2int[token]] = 1

    for time_step, token in enumerate(out_seq):
      dec_in_seqs[i, time_step, out_token2int[token]] = 1
    # 3
    for time_step, token in enumerate(out_seq[1:]):
      dec_out_seqs[i, time_step, out_token2int[token]] = 1

  return enc_in_seqs, dec_in_seqs, dec_out_seqs
from seq2seq_util import Seq2SeqBatchGenerator

batch_size = 64
train_generator = Seq2SeqBatchGenerator(
  train_samples, batch_size, encode_batch)
valid_generator = Seq2SeqBatchGenerator(
  valid_samples, batch_size, encode_batch)

Training with early stopping

Warning: Running the following cell will take considerable time. Expect it to run for multiple hours even with a GPU. If you don’t want to wait that long, change the epoch value to something small, like 10 or even just one or two. The resulting model won’t perform very well, but it’ll let you continue with the tutorial.

# 1
from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(
  monitor="val_loss", patience=5, restore_best_weights=True)
# 2
seq2seq_model.fit_generator(
  train_generator, validation_data=valid_generator,
  epochs=500, callbacks=[early_stopping])

Inference with sequence-to-sequence models

The model you’ve trained so far isn’t actually useful for inference — at least, not in its current form. Why is that? Because the decoder portion of the model requires the correctly translated text as one of its inputs! What good is a translation model that needs you to do the translations?

Assembling an inference model

First, separate the encoder and decoder into two models. Keras makes this easy. You declare a new Model and pass it the input and output layers you want to use, like this:

inf_encoder = Model(encoder_in, [encoder_h, encoder_c])
inf_encoder.summary()
Keras encoder model for inference
Luyiq ajyigij wuyin lik asxetutde

# 1
inf_dec_h_in = Input(shape=(latent_dim,), name="decoder_h_in")
inf_dec_c_in = Input(shape=(latent_dim,), name="decoder_c_in")
# 2
inf_dec_lstm_out, inf_dec_h_out, inf_dec_c_out = decoder_lstm(
  decoder_in, initial_state=[inf_dec_h_in, inf_dec_c_in])
# 3
inf_dec_out = decoder_dense(inf_dec_lstm_out)
# 4
inf_decoder = Model(
  [decoder_in, inf_dec_h_in, inf_dec_c_in],
  [inf_dec_out, inf_dec_h_out, inf_dec_c_out])
inf_decoder.summary()
Keras decoder model for inference
Totaf rezohaq qopoy yoc oqniyimro

max_out_seq_len = max(len(seq) for _, seq in samples)
start_token_idx = out_token2int[start_token]
stop_token_idx = out_token2int[stop_token]

Running inference

With those constants defined, you’re ready to actually use your models to translate text. Define the following function in your notebook. It takes a one-hot encoded sequence, such as the ones batch_encode creates, along with an encoder-decoder model pair, and returns the sequence’s translation:

def translate_sequence(one_hot_seq, encoder, decoder):
  # 1
  encoding = encoder.predict(one_hot_seq)
  # 2
  decoder_in = np.zeros(
    (1, 1, out_vocab_size), dtype=np.float32)
  # 3
  translated_text = ""
  done_decoding = False
  decoded_idx = start_token_idx
  while not done_decoding:
    # 4
    decoder_in[0, 0, decoded_idx] = 1
    # 5
    decoding, h, c = decoder.predict([decoder_in] + encoding)
    # 6
    encoding = [h, c]
    # 7
    decoder_in[0, 0, decoded_idx] = 0
    # 8
    decoded_idx = np.argmax(decoding[0, -1, :])
    # 9
    if decoded_idx == stop_token_idx:
      done_decoding = True
    else:
      translated_text += out_int2token[decoded_idx]
    # 10
    if len(translated_text) >= max_out_seq_len:
      done_decoding = True

  return translated_text
from seq2seq_util import test_predictions

test_predictions(valid_samples[:100],
                 inf_encoder, inf_decoder,
                 encode_batch, translate_sequence)
Great results on validation samples: Source, Target, Model Output
Sgiel gawufxw ol fekayofuex zafbwoy: Loanka, Kagqux, Biqip Aolpol

Good results on validation samples: Source, Target, Model Output
Pied wigebhy it hecukeraiv borwfub: Yaalki, Forhus, Buges Augziv

Bad results on validation samples: Source, Target, Model Output
Vuj jiyiyxz eb wiyobuvaix totmbij: Ceegpo, Xeqjop, Zekox Oakrew

Almost-right-but-horribly-wrong results on validation samples: Source, Target, Model Output
Avyuyn-rufxq-moc-dafvekvk-rmech nodidmk uh zuvewigiax diygbox: Poozse, Pezmof, Salud Uaktuf

Converting your model to Core ML

So far, you’ve used teacher forcing to train a Keras seq2seq model to translate Spanish text to English, then you used those trained layers to create separate encoder and decoder models that work without you needing to provide them with the correct translation. That is, you removed the teacher-forcing aspect of the model because that only makes sense while training. At this point, you should just be able to convert those encoder and decoder models to Core ML and use them in your app.

# 1
coreml_enc_in = Input(
  shape=(None, in_vocab_size), name="encoder_in")
coreml_enc_lstm = LSTM(
  latent_dim, return_state=True, name="encoder_lstm")
coreml_enc_out, _, _ = coreml_enc_lstm(coreml_enc_in)
coreml_encoder_model = Model(coreml_enc_in, coreml_enc_out)
# 2
coreml_encoder_model.output_layers = \
  coreml_encoder_model._output_layers
# 3
inf_encoder.save_weights("Es2EnCharEncoderWeights.h5")
coreml_encoder_model.load_weights("Es2EnCharEncoderWeights.h5")
import coremltools

coreml_encoder = coremltools.converters.keras.convert(
  coreml_encoder_model,
  input_names="encodedSeq", output_names="ignored")
coreml_encoder.save("Es2EnCharEncoder.mlmodel")

coreml_dec_in = Input(shape=(None, out_vocab_size))
coreml_dec_lstm = LSTM(
  latent_dim, return_sequences=True, return_state=True,
  name="decoder_lstm")
coreml_dec_lstm_out, _, _ = coreml_dec_lstm(coreml_dec_in)
coreml_dec_dense = Dense(out_vocab_size, activation="softmax")
coreml_dec_out = coreml_dec_dense(coreml_dec_lstm_out)
coreml_decoder_model = Model(coreml_dec_in, coreml_dec_out)

coreml_decoder_model.output_layers = \
  coreml_decoder_model._output_layers

inf_decoder.save_weights("Es2EnCharDecoderWeights.h5")
coreml_decoder_model.load_weights("Es2EnCharDecoderWeights.h5")
coreml_decoder = coremltools.converters.keras.convert(
  coreml_decoder_model,
  input_names="encodedChar", output_names="nextCharProbs")
coreml_decoder.save("Es2EnCharDecoder.mlmodel")

Quantization

The models you’ve saved are fine for use in an iOS app, but there’s one more simple step you should always consider. With apps, download size matters. Your model stores its weights and biases as 32-bit floats. But you could use 16-bit floats instead. That cuts your model download sizes in half, which is great, especially when you start making larger models than the ones you made in this chapter. It might also improve execution speed, because there is simply less data to move through memory.

def convert_to_fp16(mlmodel_filename):
  basename = mlmodel_filename[:-len(".mlmodel")]
  spec = coremltools.utils.load_spec(mlmodel_filename)

  spec_16bit = coremltools.utils.\
    convert_neural_network_spec_weights_to_fp16(spec)

  coremltools.utils.save_spec(
    spec_16bit, f"{basename}16Bit.mlmodel")
convert_to_fp16("Es2EnCharEncoder.mlmodel")
convert_to_fp16("Es2EnCharDecoder.mlmodel")

Numericalization dictionaries

One last thing: When you use your models in your iOS app, you’ll need to do the same one-hot encoding you did here to convert input sequences from Spanish characters into the integers your encoder expects, and then convert your decoder’s numerical output into English characters.

import json

with open("esCharToInt.json", "w") as f:
  json.dump(in_token2int, f)
with open("intToEnChar.json", "w") as f:
  json.dump(out_int2token, f)

Using your model in iOS

Most of this chapter has been about understanding and building sequence-to-sequence models for translating natural language. That was the hard part — now you just need to write a bit of code to use your trained model in iOS. However, there are a few details that may cause some confusion, so don’t stop paying attention just yet!

Looking at the encoder mlmodel file
Pooritr iw qju alxadah yglulob revu

Looking at the decoder mlmodel file
Teiwoch ub pfu tofasej zdvuwek piqo

let esCharToInt = loadCharToIntJsonMap(from: "esCharToInt")
let intToEnChar = loadIntToCharJsonMap(from: "intToEnChar")
import CoreML
func getEncoderInput(_ text: String) -> MLMultiArray? {
  // 1
  let cleanedText = text
    .filter { esCharToInt.keys.contains($0) }
  
  if cleanedText.isEmpty {
    return nil
  }
  
  // 2
  let vocabSize = esCharToInt.count
  let encoderIn = initMultiArray(
    shape: [NSNumber(value: cleanedText.count),
            1,
            NSNumber(value: vocabSize)])

  // 3
  for (i, c) in cleanedText.enumerated() {
    encoderIn[i * vocabSize + esCharToInt[c]!] = 1
  }
  
  return encoderIn
}
func getDecoderInput(encoderInput: MLMultiArray) ->
  Es2EnCharDecoder16BitInput {
  // 1
  let encoder = Es2EnCharEncoder16Bit()
  let encoderOut = try! encoder.prediction(
    encodedSeq: encoderInput,
    encoder_lstm_h_in: nil,
    encoder_lstm_c_in: nil)
  // 2
  let decoderIn = initMultiArray(
    shape: [NSNumber(value: intToEnChar.count)])
  // 3
  return Es2EnCharDecoder16BitInput(
    encodedChar: decoderIn,
    decoder_lstm_h_in: encoderOut.encoder_lstm_h_out,
    decoder_lstm_c_in: encoderOut.encoder_lstm_c_out)
}
let maxOutSequenceLength = 87
let startTokenIndex = 0
let stopTokenIndex = 1
// 1
guard let encoderIn = getEncoderInput(text) else {
  return nil
}
// 2
let decoderIn = getDecoderInput(encoderInput: encoderIn)
// 3
let decoder = Es2EnCharDecoder16Bit()
var translatedText: [Character] = []
var doneDecoding = false
var decodedIndex = startTokenIndex
while !doneDecoding {
  // 1
  decoderIn.encodedChar[decodedIndex] = 1
  // 2
  let decoderOut = try! decoder.prediction(input: decoderIn)
  // 3
  decoderIn.decoder_lstm_h_in = decoderOut.decoder_lstm_h_out
  decoderIn.decoder_lstm_c_in = decoderOut.decoder_lstm_c_out
  // 4
  decoderIn.encodedChar[decodedIndex] = 0
}
// 1
decodedIndex = argmax(array: decoderOut.nextCharProbs)
// 2
if decodedIndex == stopTokenIndex {
  doneDecoding = true
} else {
  translatedText.append(intToEnChar[decodedIndex]!)
}
// 3
if translatedText.count >= maxOutSequenceLength {
  doneDecoding = true
}
return String(translatedText)
func spanishToEnglish(text: String) -> String? {
  guard let encoderIn = getEncoderInput(text) else {
    return nil
  }

  let decoderIn = getDecoderInput(encoderInput: encoderIn)

  let decoder = Es2EnCharDecoder16Bit()
  var translatedText: [Character] = []
  var doneDecoding = false
  var decodedIndex = startTokenIndex
  
  while !doneDecoding {
    decoderIn.encodedChar[decodedIndex] = 1

    let decoderOut = try! decoder.prediction(input: decoderIn)
    decoderIn.decoder_lstm_h_in = decoderOut.decoder_lstm_h_out
    decoderIn.decoder_lstm_c_in = decoderOut.decoder_lstm_c_out
    decoderIn.encodedChar[decodedIndex] = 0
    
    decodedIndex = argmax(array: decoderOut.nextCharProbs)
    if decodedIndex == stopTokenIndex {
      doneDecoding = true
    } else {
      translatedText.append(intToEnChar[decodedIndex]!)
    }

    if translatedText.count >= maxOutSequenceLength {
      doneDecoding = true
    }
  }
  
  return String(translatedText)
}
func getSentences(text: String) -> [String] {
  let tokenizer = NLTokenizer(unit: .sentence)
  tokenizer.string = text
  let sentenceRanges = tokenizer.tokens(
    for: text.startIndex..<text.endIndex)
  return sentenceRanges.map { String(text[$0]) }
}
SMDB app with translated reviews
CGJC ocn yanx dvokpwimos sixauny

Let’s talk translation quality

Judging from these results, no one would blame you for thinking this model isn’t very good. But before you give up on seq2seq models, let’s try to explain this performance as well as some possible solutions.

Key points

Where to go from here?

This chapter introduced sequence-to-sequence models and showed you how to make one that could translate text from Spanish to English. Sometimes. The next chapter picks up where this one ends and explores some more advanced options to improve the quality of your model’s translations.

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Have feedback to share about the online reading experience? If you have feedback about the UI, UX, highlighting, or other features of our online readers, you can send them to the design team with the form below:

© 2020 Razeware LLC

You're reading for free, with parts of this chapter shown as obfuscated text. Unlock this book, and our entire catalogue of books and videos, with a raywenderlich.com Professional subscription.

Unlock Now

To highlight or take notes, you’ll need to own this book in a subscription or purchased by itself.