Home iOS & Swift Books Machine Learning by Tutorials

Natural Language Transformation, Part 2 Written by Alexis Gallagher

The previous chapter introduced sequence-to-sequence models, and you built one that (sort of) translated Spanish text to English. This chapter introduces other techniques that can improve performance for such tasks. It picks up where you left off, so continue using the same nlpenv environment and SMDB project you already made. It’s inadvisable to read this chapter without first completing that one, but if you’d like a clean starter project, use the final version of SMDB found in Chapter 15’s resources.

Bidirectional RNNs

Your original model predicts the next character using only the characters that appear before it in the sentence. But is that really how people read? Consider the following two English sentences and their Spanish translations (according to Google Translate):

Examples where context after a word matters
Examples where context after a word matters

The first five words are the same in the English versions of both sentences, but only the first two words end up the same in the Spanish translations. That’s because the meaning of the word “bank” is different in each sentence, but you cannot know that until you’ve read past that word in the sentence. That is, its meaning comes from its context, including the words both before and after it.

In order to consider the full context surrounding each token, you can use what’s called a bidirectional recurrent neural network (BRNN), which processes sequences in both directions, like this:

Bidirectional RNN
Bidirectional RNN

Bidirectional RNN
Bidirectional RNN

The forward and reverse layers themselves can be any recurrent type, such as the LSTMs you’ve worked with elsewhere in this book. However, in this chapter, you’ll use a new type called a gated recurrent unit, or GRU.

GRUs were invented after LSTMs and were meant to serve the same purpose of learning longer-term relationships while training more easily than standard recurrent layers. Internally, they are implemented differently from LSTMs, but, from a user’s standpoint, the main difference is that they do not have separate hidden and cell states. Instead, they only have hidden states, which makes them a bit less complicated to work with when you have to manage state directly — like you do with the decoder in a seq2seq model.

So now you’ll try a new version of the model you trained in the previous chapter — one that includes a bidirectional encoder. The Python code for this section is nearly identical to what you wrote for your first seq2seq model. As such, the chapter’s resources include a pre-filled Jupyter notebook for you to run at notebooks/Bidir-Char-Seq2Seq-Starter.ipynb. Or, you can just review the contents of notebooks/Bidir-Char-Seq2Seq-Complete.ipynb, which shows the output from the run used to build the pre-trained bidirectional model included in the notebooks/pre-trained/BidirCharModel/ folder.

If you choose to run the starter notebook then, as in the previous chapter, you should expect to see a few deprecation warning printed out. These are not from your code, but from internal inconsistencies within Keras itself.

The rest of this section goes over the important differences between this and the previous model you built.

The first difference isn’t out of necessity, but this model uses a larger latent_dim value:

latent_dims = 512

The previous model used 256 dimensions, which meant you passed 512 features from your encoder to your decoder — the LSTM produced two 256-length vectors, one for the hidden state and one for the cell state. GRU layers don’t have a cell state, so they return only a single vector of length latent_dim. Rather than send only half the amount of information to the decoder, the author chose to double the size of the GRUs.

The biggest differences for this model are in the encoder, so let’s go over its definition:

# 1
encoder_in = Input(
  shape=(None, in_vocab_size), name="encoder_in")
encoder_mask = Masking(name="encoder_mask")(encoder_in)
# 2
fwd_enc_gru = GRU(
  latent_dim, recurrent_dropout=0.3, name="fwd_enc_gru")
rev_enc_gru = GRU(
  latent_dim, go_backwards=True, recurrent_dropout=0.3,
fwd_enc_out = fwd_enc_gru(encoder_mask)
rev_enc_out = rev_enc_gru(encoder_mask)
# 3
encoder_out = Concatenate(name="encoder_out")(
  [fwd_enc_out, rev_enc_out])

This encoder uses a bidirectional RNN with GRU layers. Here’s how you set it up:

  1. The Input and Masking layers are identical to the previous chapter’s encoder.
  2. Rather than creating one recurrent layer, you create two — one that processes the sequence normally and one that processes it in reverse because you set go_backwards=True. You feed the same masking layer into both of these layers.
  3. Finally, you concatenate the outputs from the two GRU layers so the encoder can output them together in a single vector. Notice that, unlike in the previous chapter, here you don’t use the h states and instead use the layer outputs. This wasn’t mentioned before, but that works because the hidden states are the outputs. The reason you used the states for the LSTM was to get at the cell states, which are not returned as outputs like the hidden states are.

As far as the decoder goes, one important difference is in the size of the inputs it expects. We define a new variable called decoder_latent_dim, like this:

decoder_latent_dim = latent_dim * 2

The decoder’s recurrent layer needs twice as many units as the encoder’s did because it accepts a vector that contains the concatenated outputs from two of them — forward and reverse.

The only other differences with the decoder are in the following lines:

decoder_gru = GRU(
  decoder_latent_dim, return_sequences=True,
  return_state=True, dropout=0.2, recurrent_dropout=0.3,
decoder_gru_out, _ = decoder_gru(
  decoder_mask, initial_state=encoder_out)

Once again, you use a GRU layer instead of an LSTM, but use decoder_latent_dim instead of latent_dim to account for the forward and reverse states coming from the encoder. Notice the GRU only returns hidden states, which you ignore for now by assigning them to an underscore variable. This differs from the LSTM you used in the previous chapter, which returned both hidden and cell states.

Note: One important detail is that the decoder does not implement a bidirectional network like the encoder does. That’s because the decoder doesn’t actually process whole sequences — it just takes a single character along with state information.

If you run this notebook, or look through the completed one provided, you’ll see a few things. First, this model is much larger than the last one you built — about 5.4 million parameters versus 741 thousand. Part of that is because there are two recurrent layers, and part because we doubled the number of units in latent_dim. Still, each epoch only takes a bit longer to train.

The other thing that stands out is the performance. This model trained to a validation loss of 0.3533 by epoch 128 (before automatically stopping training at epoch 133). Compare that to the previous model, which only achieved a 0.5905 validation loss, and it took 179 epochs to do it. So this model achieved lower loss in fewer epochs, thanks mostly to the additional information gleaned from the bidirectional encoder.

For inference, the only difference is with the encoder’s output. Instead of outputting the encoder’s latent state, you use the concatenated layer encoder_out, like this:

inf_encoder = Model(encoder_in, encoder_out)

The notebook includes code to export your encoder and decoder models to Core ML. There are slight differences to match the new model architecture, but nothing should look unfamiliar to you. It includes the same workarounds you used in the last chapter.

Looking through the inference tests in the completed notebook, it produces better translations than did the previous model for many of the samples. For example:

It does about as well on most — but not all — of the other tests, too. Some of the most interesting are those it gets wrong, but less wrong than the last model did. Such as:

Notice that, in each of these examples, the bidirectional model does better then the previous chapter’s model when translating words that appear near the end of the sentences. That makes sense, since it looks at the sequence in both directions, letting it encode more context for the decoder.

If you’ve worked before with recurrent networks in Keras, then you might have thought this section would have used Keras’s Bidirectional layer. Before trying out your new model in Xcode, take a look at this brief discussion of why we didn’t use that class, here.

Why not use Keras’s Bidirectional layer?

Keras includes a Bidirectional layer that simplifies the creation of bidirectional RNNs. You initialize it with a single recurrent layer, like an LSTM or GRU layer, and it handles duplicating that as a reversed layer for you. To use it, you’d write something like this for a bidirectional LSTM:

encoder_lstm = Bidirectional(
  LSTM(latent_dim, return_state=True, recurrent_dropout=0.3),
encoder_out, fwd_enc_h, fwd_enc_c, rev_enc_h, rev_enc_c = \
encoder_gru = Bidirectional(
  GRU(latent_dim, return_state=True, recurrent_dropout=0.3),
encoder_out, fwd_enc_h, rev_enc_h = encoder_gru(encoder_mask)
encoder_h = Concatenate(name="encoder_out")(
  [fwd_enc_h, rev_enc_h])
encoder_c = Concatenate(name="encoder_out")(
  [fwd_enc_c, rev_enc_c])

Using your bidirectional model in Xcode

Open the SMDB project you’ve been working with for the past couple chapters in Xcode, or use the starter project found in this chapter’s resources. Then, add the Es2EnBidirGruCharEncoder16Bit.mlmodel and Es2EnBidirGruCharDecoder16Bit.mlmodel models to SMDB like you’ve done before. If you didn’t train your own, you can find the ones we trained in the notebooks/pre-trained/BidirCharModel folder.

Looking at the encoder mlmodel file
Hoeruqk ip gbe ixkapay fkbipif suva

func getBidirDecoderInput(encoderInput: MLMultiArray) ->
  Es2EnBidirGruCharDecoder16BitInput {
  let encoder = Es2EnBidirGruCharEncoder16Bit()
  let encoderOut = try! encoder.prediction(
    oneHotEncodedSeq: encoderInput,
    fwd_enc_gru_h_in: nil, 
    rev_enc_gru_h_in: nil)

  let decoderIn = initMultiArray(
    shape: [NSNumber(value: intToEnChar.count)])

  return Es2EnBidirGruCharDecoder16BitInput(
    encodedChar: decoderIn,
    decoder_gru_h_in: encoderOut.decodersIntialState)
let decoderIn = getDecoderInput(encoderInput: encoderIn)
let decoder = Es2EnCharDecoder16Bit()
let decoderIn = getBidirDecoderInput(encoderInput: encoderIn)
let decoder = Es2EnBidirGruCharDecoder16Bit()
decoderIn.decoder_lstm_h_in = decoderOut.decoder_lstm_h_out
decoderIn.decoder_lstm_c_in = decoderOut.decoder_lstm_c_out
decoderIn.decoder_gru_h_in = decoderOut.decoder_gru_h_out
SMDB app with reviews translated by bidirectional character-level model
NDRH urb panx felaofr jdethguhef gs vexebenwiuqav msecihsiy-rexaf fexah

Beam search

This and the previous chapter have both implied there’s a better option than greedily choosing the token predicted with the highest probability at each timestep. The solution most commonly used is called beam search, and you should strongly consider implementing it if you want to improve the quality of a model’s generated sequences.


The previous chapter mentioned an important problem with the encoder portion of your seq2seq model: It needs to encode the entire sequence into a single, fixed-length vector. That limits the length of the input sequences it can successfully handle, because each new token essentially dilutes the stored information.

Attention alignments example from Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (ICLR 2015)
Eypirbaur agabskixnf opuhzdi dqum Lurnonii, V., Cfi, J., ixc Xajhio, Z. (4609). Yiunuf nuqqumu wxaqkvuteuf kb diumrqm xiebroyl ka evixn imm vxuckgisu. Ogcilxewaeved Qiyfekadyu og Baoxyudh Jaxponosnipeojt (OBXF 1948)

Why use characters at all?

The seq2seq models you’ve made in this book work with sequences at the character level, but why? How much information does a model get from each token when it views sequences this way? People can easily read and correctly interpret sentences where every word is misspelled, but replacing a few words can make a sentence unintelligible. It seems like most individual characters don’t add much information to a sentence, whereas most words do, so shouldn’t translation models consider words instead?

Words as tokens and word embedding

Recall that neural networks require numerical inputs. So far, you’ve been one-hot encoding text prior to using it, but that essentially means your network sees mostly just zeros. What if you could provide more useful information?

Marvel characters projected onto D&D alignments
Pewfid tjuxalnanh wgoyejmug ugri X&D ihiryzimxd

Word relationship example from Pennington, J., Socher, R., and Manning, C. D. (2014) GloVe: Global Vectors for Word Representation.
Veph himuwiiskfig elojltu zlam Zerzutgwek, W., Vawrew, V., igy Bovfobw, Y. T. (2386) NzuGu: Gfojem Cifnimf jik Xubn Hortiqapgevouc.

Word embeddings in iOS

Apple provides support for word embeddings via the MLWordEmbedding and NLEmbedding types. That support makes certain uses of word embeddings straightforward.

let vectors = [
  "Captain America": [0.0, 1], "Rocket Raccoon": [1, 1],
  "Hulk": [1, 0],  "Loki": [1, -1],
  "Thanos": [0, -1], "Red Skull": [-1, -1],
  "Black Widow": [-1, 0], "Nova Corps": [-1, 1],
// 1
let embedding1 = try MLWordEmbedding(dictionary: vectors)
try embedding1.write(to: marvelModelUrl)
// 2
let compiledUrl = try MLModel.compileModel(at: marvelModelUrl)
// 3
let embedding2 = try NLEmbedding(contentsOf: compiledUrl)
embedding2.distance(between: "Captain America", 
                    and: "Rocket Raccoon", 
                    distanceType: .cosine)
// => 1.414
embedding2.distance(between: "Captain America", 
                    and: "Loki", 
                    distanceType: .cosine)
// => 1.847
embedding2.distance(between: "Captain America", 
                    and: "Thanos", 
                    distanceType: .cosine)
// => 1.414
func cosineDistance(v: [Double], w: [Double]) -> Double {
  let innerProduct = zip(v, w)
    .map { $0 * $1 }
    .reduce(0, +)
  func magnitude(_ x: [Double]) -> Double {
      .map { $0 * $0}
  let cos =  innerProduct / (magnitude(v) * magnitude(w))
  return 1 - cos
cosineDistance(v: vectors["Captain America"]!, 
               w: vectors["Rocket Raccoon"]!)
// => 0.29
cosineDistance(v: vectors["Captain America"]!, 
               w: vectors["Loki"]!)
// => 1.707
cosineDistance(v: vectors["Captain America"]!, 
               w: vectors["Thanos"]!)
// => 2

Building models with word embeddings

This section points out some changes you’d need to make to your existing seq2seq models in order to have them use word tokens instead of characters. This section includes code snippets you can use, but it doesn’t spell out every detail necessary to build such a model. Don’t worry! With these tips and what you’ve already learned, you’re well prepared to build these models on your own. Consider it a challenge!

es_token_vectors[unk_token] =
  2 * np.random.rand(embedding_dim).astype(np.float32) - 1
# 1
num_enc_embeddings = in_vocab_size + 1
# 2
pretrained_embeddings = np.zeros(
  (num_enc_embeddings, embedding_dim), dtype=np.float32)
# 3
for i, t in enumerate(in_vocab):
  pretrained_embeddings[i+1] = es_token_vectors[t]
# 1
enc_embeddings = Embedding(
  num_enc_embeddings, embedding_dim,
  weights=[pretrained_embeddings], trainable=False,
  mask_zero=True, name="encoder_embeddings")
# 2
enc_embedded_in = enc_embeddings(encoder_in)
num_dec_embeddings = out_vocab_size + 1
dec_embeddings = Embedding(
  num_dec_embeddings, embedding_dim,
  mask_zero=True, name="decoder_embeddings")
in_token2int = {token : i + 1
                for i, token in enumerate(in_vocab)}
out_token2int = {token : i + 1
                 for i, token in enumerate(out_vocab)}
enc_in_seqs[i, time_step] = in_token2int[token]
enc_in_seqs[i, time_step, in_token2int[token]] = 1

Using word embeddings in iOS

When it comes to using your trained model in an app, it’s similar to what you did in the SMDB project. However, there are a few important caveats:

Key points

Where to go from here?

These past three chapters have only scratched the surface of the field of NLP. Hopefully, they’ve shown you how to accomplish some useful things in your apps, while sparking your interest to research other topics. So what’s next?

Have a technical question? Want to report a bug? You can ask questions and report bugs to the book authors in our official book forum here.

Have feedback to share about the online reading experience? If you have feedback about the UI, UX, highlighting, or other features of our online readers, you can send them to the design team with the form below:

© 2020 Razeware LLC

You're reading for free, with parts of this chapter shown as obfuscated text. Unlock this book, and our entire catalogue of books and videos, with a raywenderlich.com Professional subscription.

Unlock Now

To highlight or take notes, you’ll need to own this book in a subscription or purchased by itself.