Scanner Tutorial for macOS

Hai Nguyen

Update 9/25/16: This tutorial has been updated for Xcode 8 and Swift 3.

Update note: This tutorial has been updated to Swift by Hai Nguyen. The original tutorial was written by Vincent Ngo.

NSScannerFeatureImage

In these days of big data, data is stored in a multitude of formats, which poses a challenge to anyone trying to consolidate and make sense of it. If you’re lucky, the data will be in an organized, hierarchical format such as JSON, XML, or CSV. Otherwise, you might have to struggle with endless if/else cases. Either way, manually extracting data is no fun.

Thankfully, Apple provides a set of tools that you can use to analyze string data in any form, from natural to computer languages, such as NSRegularExpression, NSDataDetector or Scanner. Each of them has its own advantages, but Scanner is by far the easiest to use yet powerful and flexible. In this tutorial, you’ll learn how to extract information from email messages with its methods, in order to build a macOS application that works like Apple Mail’s interface as shown.

Completed-Final-Screen

Although you’ll be building an app for Mac, Scanner is also available on iOS. By the end of this tutorial, you will be ready to parse text on either platform.

Before getting things started, let’s first see what Scanner is capable of!

Scanner Overview

Scanner‘s main functionality is to retrieve and interpret substring and numeric values.

For example, Scanner can analyze a phone number and break it down into components like this:

// 1.
let hyphen = CharacterSet(charactersIn: "-")

// 2.
let scanner = Scanner(string: "123-456-7890")
scanner.charactersToBeSkipped = hyphen

// 3.
var areaCode, firstThreeDigits, lastFourDigits: NSString?

scanner.scanUpToCharacters(from: hyphen, into: &areaCode)          // A
scanner.scanUpToCharacters(from: hyphen, into: &firstThreeDigits)  // B
scanner.scanUpToCharacters(from: hyphen, into: &lastFourDigits)    // C

print(areaCode!, firstThreeDigits!, lastFourDigits!)// 123 - area code
// 456 - first three digits
// 7890 - last four digits

Here’s what this code does:

  1. Creates an instance of CharacterSet named hyphen. This will be used as the separator between string components.
  2. Initializes a Scanner object and changes its charactersToBeSkipped default value (whitespace and linefeed) to hyphen, so the returning strings will NOT include any hyphens.
  3. areaCode, firstThreeDigits and lastFourDigits will store parsed values that you get back from the scanner. Since you cannot port Swift native String directly to AutoreleasingUnsafeMutablePointer, you have to declare these variables as optional NSString objects in order to pass them into the scanner’s method.
    1. Scans up to the first character and assigns the values in front of the hyphen character into areaCode.
    2. Continues scanning to the second and grabs the next three digits into firstThreeDigits. Before you invoke scanUpToCharactersFromSet(from:into:), the scanner’s reading cursor was at the position of the first found -. With the hyphen ignored, you get the phone number’s second component.
    3. Finds the next -. The scanner finishes the rest of the string and returns a successful status. With no hyphen left, it simply puts the remaining substring into lastFourDigits.

That’s all Scanner does. It’s that easy! Now, it’s time to get your application started!

Getting Started

Download the starter project and extract the the contents of the ZIP file. Open EmailParser.xcodeproj in Xcode.

You’ll find the following:

  • DataSource.swift contains a pre-made structure that sets up the data source/delegate to populate a table view.
  • PostCell.swift contains all the properties that you need to display each individual data item.
  • Support/Main.storyboard contains a TableView with a custom cell on the left hand-side and a TextView on the other.

You’ll be parsing the data of 49 sample files in comp.sys.mac.hardware folder. Take a minute to browse though to see how it’s structured. You’ll be collecting items like Name, Email, and so on into a table so that they are easy to see at a glance.

Note: The starter project uses table views to present the data, so if you’re unfamiliar with table views, check out our macOS NSTableView Tutorial.

Build and run the project to see it in action.

Starter-Initial-Screen

The table view currently displays placeholder labels with [Field]Value prefix. By the end of the tutorial, those will be replaced with parsed data.

Understanding the Structure of Raw Samples

Before diving straight into parsing, it’s important to understand what you’re trying to achieve. Below is one of the sample files, with the data items you’ll be retrieving highlighted.

Data-Structure-Illustration

In summary, these data items are:

  • From field: this consists of the sender’s name and email. Parsing it can be tricky since the name may come before the email or vice versa; it might even contain one piece but not the other.
  • Subject, Date, Organization and Lines fields: these have values separated by colons.
  • Message segment: this can contain cost information and some of these following keywords: apple, macs, software, keyboard, printer, video, monitor, laser, scanner, disks, cost, price, floppy, card, and phone.

Scanner is awesome; however, working with it can feel a bit cumbersome and far less “Swifty”, so you’ll convert the built-in methods like the one in the phone number example above to ones that return optionals.

Navigate to File\New\File… (or simply press Command+N). Select macOS > Source > Swift File and click Next. Set the file’s name to Scanner+.swift, then click Create.

Open Scanner+.swift and add the following extension:

extension Scanner {
  
  func scanUpToCharactersFrom(_ set: CharacterSet) -> String? {
    var result: NSString?                                                           // 1.
    return scanUpToCharacters(from: set, into: &result) ? (result as? String) : nil // 2.
  }
  
  func scanUpTo(_ string: String) -> String? {
    var result: NSString?
    return self.scanUpTo(string, into: &result) ? (result as? String) : nil
  }
  
  func scanDouble() -> Double? {
    var double: Double = 0
    return scanDouble(&double) ? double : nil
  }
}

These helper methods encapsulate some of the Scanner methods you’ll use in this tutorial so that they return an optional String. These three methods share the same structure:

  1. Defines a result variable to hold the value returned by the scanner.
  2. Uses a ternary operator to check whether the scan is successful. If it is, converts result to String and returns it; otherwise simply returns nil.
Note: You can do the same to other Scanner methods like you did above and save them to your arsenals:

  • scanDecimal(_:)
  • scanFloat(_:)
  • scanHexDouble(_:)
  • scanHexFloat(_:)
  • scanHexInt32(_:)
  • scanHexInt64(_:)
  • scanInt(_:)
  • scanInt32(_:)
  • scanInt64(_:)

Simple, right? Now go back to the main project and start parsing!

Creating the Data Structure

Navigate to File\New\File… (or simply press Command+N). Select macOS > Source > Swift File and click Next. Set the file’s name to HardwarePost.swift, then click Create.

Open HardwarePost.swift and add the following structure:

struct HardwarePost {
  // MARK: Properties
  
  // the fields' values once extracted placed in the properties
  let email: String
  let sender: String
  let subject: String
  let date: String
  let organization: String
  let numberOfLines: Int
  let message: String
  
  let costs: [Double]         // cost related information
  let keywords: Set<String>   // set of distinct keywords
}

This code defines HardwarePost structure that stores the parsed data. By default, Swift provides you a default constructor based on its properties, but you’ll come back to this later to implement your own custom initializer.

Are you ready for parsing in action with Scanner? Let’s do this.

Creating the Data Parser

Navigate to File\New\File… (or simply press Command+N), select macOS > Source > Swift File and click Next. Set the file’s name to ParserEngine.swift, then click Create.

Open ParserEngine.swift and create ParserEngine class by adding the following code:

final class ParserEngine {

}

Extracting Metadata Fields

Consider the following sample metadata segment:

Metadata-Segment

Here’s where Scanner comes in and separates the fields and their values. The image below gives you a general visual representation of this structure.

Field-Structure-Illustraion

Open ParserEngine.swift and implement this code inside ParserEngine class:

// 1.
typealias Fields = (sender: String, email: String, subject: String, date: String, organization: String, lines: Int)

/// Returns a collection of predefined fields' extracted values
func fieldsByExtractingFrom(_ string: String) -> Fields {
  // 2.
  var (sender, email, subject, date, organization, lines) = ("", "", "", "", "", 0)
  
  // 3.
  let scanner = Scanner(string: string)
  scanner.charactersToBeSkipped = CharacterSet(charactersIn: " :\n")
  
  // 4.
  while !scanner.isAtEnd {                  // A
    let field = scanner.scanUpTo(":") ?? "" // B
    let info = scanner.scanUpTo("\n") ?? "" // C
    
    // D
    switch field {
    case "From": (email, sender) = fromInfoByExtractingFrom(info) // E
    case "Subject": subject = info
    case "Date": date = info
    case "Organization": organization = info
    case "Lines": lines = Int(info) ?? 0
    default: break
    }
  }
  
  return (sender, email, subject, date, organization, lines)
}

Don’t panic! The Xcode error of an unresolved identifier will go away right in the next section.

Here’s what the above code does:

  1. Defines a Fields type alias for the tuple of parsed fields.
  2. Creates variables that will hold the returning values.
  3. Initializes a Scanner instance and changes its charactersToBeSkipped property to also include a colon beside the default values – whitespace and linefeed.
  4. Obtains values of all the wanted fields by repeating the process below:
    1. Uses while to loop through string‘s content until it reaches the end.
    2. Invokes one of the helper functions you created earlier to get field‘s title before :.
    3. Continues scanning up to the end of the line where the linefeed character \n is located and assigns the result to info.
    4. Uses switch to find the matching field and stores its info property value into the proper variable.
    5. Analyzes From field by calling fromInfoByExtractingFrom(_:). You’ll implement the method after this section.

Remember the tricky part of From field? Hang tight because you’re going to need help from regular expression to overcome this challenge.

Note: Regular expressions are a great tool to manipulate strings with patterns, and this NSRegularExpression Tutorial gives a good overview of how to use them.

At the end of ParserEngine.swift, add the following String extension:

private extension String {
  
  func isMatched(_ pattern: String) -> Bool {
    return NSPredicate(format: "SELF MATCHES %@", pattern).evaluate(with: self)
  }
}

This extension defines a private helper method to find whether the string matches a given pattern using regular expressions.

It creates a NSPredicate object with a MATCHES operator using the regular expression pattern. Then it invokes evaluate(with:) to check if the string matches the conditions of the pattern.

Note: You can read more about NSPredicate in the official Apple documentation.

Now add the following method inside the ParserEngine implementation, just after fieldsByExtractingFrom(_:) method:

fileprivate func fromInfoByExtractingFrom(_ string: String) -> (email: String, sender: String) {
  let scanner = Scanner(string: string)
  
  // 1.
  /*
   * ROGOSCHP@MAX.CC.Uregina.CA (Are we having Fun yet ???)
   * oelt0002@student.tc.umn.edu (Bret Oeltjen)
   * (iisi owner)
   * mbuntan@staff.tc.umn.edu ()
   * barry.davis@hal9k.ann-arbor.mi.us (Barry Davis)
   */
  if string.isMatched(".*[\\s]*\\({1}(.*)") { // A
    scanner.charactersToBeSkipped = CharacterSet(charactersIn: "() ") // B
    
    let email = scanner.scanUpTo("(")  // C
    let sender = scanner.scanUpTo(")") // D
    
    return (email ?? "", sender ?? "")
  }
  
  // 2.
  /*
   * "Jonathan L. Hutchison" <jh6r+@andrew.cmu.edu>
   * <BR4416A@auvm.american.edu>
   * Thomas Kephart <kephart@snowhite.eeap.cwru.edu>
   * Alexander Samuel McDiarmid <am2o+@andrew.cmu.edu>
   */
  if string.isMatched(".*[\\s]*<{1}(.*)") {
    scanner.charactersToBeSkipped = CharacterSet(charactersIn: "<> ")
    
    let sender = scanner.scanUpTo("<")
    let email = scanner.scanUpTo(">")
    
    return (email ?? "", sender ?? "")
  }
  
  // 3.
  return ("unknown", string)
}

After examining the 49 data sets, you end up with three cases to consider:

  • email (name)
  • name <email>
  • email with no name

Here’s what the code does:

  1. Matches string with the first pattern – email (name). If not, continues to the next case.
    1. Looks for zero or more occurrences of any character – .*, followed by zero or more occurrence of a space – [\\s]*, followed by one open parenthesis – \\({1} and finally zero or more occurrences of a string – (.*).
    2. Sets the Scanner object’s charactersToBeSkipped to include: “(“, “)” and whitespace.
    3. Scans up to ( to get the email value.
    4. Scans up to ), which gives you the sender name. This extracts everything before ( and after ).
  2. Field-Value-Illustration

  3. Checks whether the given string matches the pattern – name <email>. The if body is practically the same as the first scenario, except that you deal with angle brackets.
  4. Finally, if neither of the two patterns is matched, this is the case where you only have an email. You’ll simply return the string for the email and “unknown” for sender.

At this point, you can build the project. The previous compile error is gone.

Starter-Initial-Screen

Note: NSDataDetector would be a better solution for known-data types like phone number, address, and email. You can check out this blog about email validation with NSDataDetector.

You’ve been working with Scanner to analyze and retrieve information from a patterned string. In the next two sections, you’ll learn how to parse unstructured data.

Extracting Cost-Related Information

A good example of parsing unstructured data is to determine whether the email’s body contains cost-related information. To do this, you’ll use Scanner to search for an occurrence of a dollar character: $.

Still working on ParserEngine.swift, add the following implementation inside ParserEngine class:

func costInfoByExtractingFrom(_ string: String) -> [Double] {
  // 1.
  var results = [Double]()
  
  // 2.
  let dollar = CharacterSet(charactersIn: "$")
  
  // 3.
  let scanner = Scanner(string: string)
  scanner.charactersToBeSkipped = dollar
  
  // 4.
  while !scanner.isAtEnd && scanner.scanUpToCharacters(from: dollar, into: nil) {
    results += [scanner.scanDouble()].flatMap { $0 }
  }
  
  return results
}

The code is fairly straightforward:

  1. Defines an empty array to store the cost values.
  2. Creates a CharacterSet object with a $ character.
  3. Initializes a Scanner instance and configures it to ignore the $ character.
  4. Loops through string‘s content and when a $ is found, grabs the number after $ with your helper method and appends it to results array.

Parsing the Message

Another example of parsing unstructured data is finding keywords in a given body of text. Your search strategy is to look at every word and check it against a set of keywords to see if it matches. You’ll use the whitespace and newline characters to take the words in the message as scanning.

Keywords-Parser-Illustration

Add the following code at the end of ParserEngine class:

// 1.
let keywords: Set<String> = ["apple", "macs", "software", "keyboard",
                             "printers", "printer", "video", "monitor",
                             "laser", "scanner", "disks", "cost", "price",
                             "floppy", "card", "phone"]

/// Return a set of keywords extracted from
func keywordsByExtractingFrom(_ string: String) -> Set<String> {
  // 2.
  var results: Set<String> = []
  
  // 3.
  let scanner = Scanner(string: string)
  
  // 4.
  while !scanner.isAtEnd, let word = scanner.scanUpTo(" ")?.lowercased()  {
    if keywords.contains(word) {
      results.insert(word)
    }
  }
  
  return results
}

Here’s what this code does:

  1. Defines the keywords set that you’ll match against.
  2. Creates a Set of String to store the found keywords.
  3. Initializes a Scanner instance. You’ll use the default charactersToBeSkipped, which are the whitespace and newline characters.
  4. For every word found, checks whether it’s one of the predefined keywords. If it is, appends it into results.

There — you have all of the necessary methods to acquire the desired information. Time to put them to good use and create HardwarePost instances for the 49 data files.

Connecting the Parser With Data Samples

Open HardwarePost.swift and add this initializer into HardWarePost structure:

init(fromData data: Data) {
  // 1.
  let parser = ParserEngine()
  
  // 2.
  let string = String(data: data, encoding: String.Encoding.utf8) ?? ""
  
  // 3.
  let scanner = Scanner(string: string)
  
  // 4.
  let metadata = scanner.scanUpTo("\n\n") ?? ""
  let (sender, email, subject, date, organization, lines) = parser.fieldsByExtractingFrom(metadata)
  
  // 5.
  self.sender = sender
  self.email = email
  self.subject = subject
  self.date = date
  self.organization = organization
  self.numberOfLines = lines
  
  // 6.
  let startIndex = string.characters.index(string.startIndex, offsetBy: scanner.scanLocation)                                               // A
  let message = string[startIndex..<string.endIndex]                      // B
  self.message = message.trimmingCharacters(in: .whitespacesAndNewlines ) // C
  
  // 7.
  costs = parser.costInfoByExtractingFrom(message)
  keywords = parser.keywordsByExtractingFrom(message)
}

Here's how HardwarePost initializes its properties:

  1. Simply creates a ParserEngine object named parser.
  2. Converts data into a String.
  3. Initializes an instance of Scanner to parse the Metadata and Message segments, which are separated by "\n\n".
  4. Scans up to the first \n\n to grab the metadata string, then invokes the parser's fieldsByExtractingFrom(_:) method to obtain all of the metadata fields.
  5. Assigns the parsing results to the HardwarePost properties.
  6. Prepares the message content:
    1. Gets the current reading cursor from scanner with scanLocation and converts it to String.CharacterView.Index, so you can substitute string by range.
    2. Assigns the remaining string that scanner has yet to read into the new message variable.
    3. Since message value still contains \n\n where the scanner left off from the previous reading, you need to trim it and give the new value back to the HardwarePost instance's message property.
  7. Invokes the parser's methods with message to retrieve values for cost and keywords properties.

At this point, you can create HardwarePost instances directly from the files' data. You are only few more steps from displaying the final product!

Displaying Parsed Data

Open PostCell.swift and add the following method inside the PostCell class implementation:

func configure(_ post: HardwarePost) {
  
  senderLabel.stringValue = post.sender
  emailLabel.stringValue = post.email
  dateLabel.stringValue = post.date
  subjectLabel.stringValue = post.subject
  organizationLabel.stringValue = post.organization
  numberOfLinesLabel.stringValue = "\(post.numberOfLines)"
  
  // 1.
  costLabel.stringValue = post.costs.isEmpty ? "NO" : 
                                               post.costs.map { "\($0)" }.lazy.joined(separator: "; ")
  
  // 2.
  keywordsLabel.stringValue = post.keywords.isEmpty ? "No keywords found" : 
                                                      post.keywords.joined(separator: "; ")
}

This code assigns the post values to the cell labels. costLabel and keywordsLabel require special treatment because they can be empty. Here's what happens:

  1. If the costs array is empty, it sets the costLabel string value to NO; otherwise, it concatenates the cost values with "; " as a separator.
  2. Similarly, sets keywordsLabel string value to No words found for an empty set of post.keywords.

You're almost there! Open DataSource.swift. Delete the DataSource initializer init() and add the following code into the class:

let hardwarePosts: [HardwarePost] // 1.

override init() {
  self.hardwarePosts = Bundle.main                                                // 2.
    .urls(forResourcesWithExtension: nil, subdirectory: "comp.sys.mac.hardware")? // 3.
    .flatMap( { try? Data(contentsOf: $0) }).lazy                                 // 4.                                                                    
    .map(HardwarePost.init) ?? []                                                 // 5.
  
  super.init()
}

This is what the code does:

  1. Stores the HardwarePost instances.
  2. Obtains a reference to the application's main Bundle.
  3. Retrieves urls of the sample files inside the comp.sys.mac.hardware directory.
  4. Lazily acquires an array of Data instances by reading file contents with Data failable initializer and flatMap(_:). The idea of using flatMap(_:) is to get back a subarray containing only elements that are not nil.
  5. Finally, transforms the Data results to a HardwarePost object and assigns them to the DataSource hardwarePosts property.

Now you need to set up the table view's data source and delegate so that your app can show your hard work.

Open DataSource.swift. Find numberOfRows(in:) and replace it with the following:

func numberOfRows(in tableView: NSTableView) -> Int {
    return hardwarePosts.count
}

numberOfRows(in:) is part of the table view’s data source protocol; it sets the number of rows of the table view.

Next, find tableView(_:viewForTableColumn:row:) and replace the comment that says: //TODO: Set up cell view with the code below:

cell.configure(hardwarePosts[row]) 

The table view invokes its delegate tableView(_:viewForTableColumn:row:) method to set up every individual cell. It gets a reference to the post for that row and invokes PostCell's configure(_:) method to display the data.

Now you need to show the post in the text view when you select a post on the table view. Replace the initial implementation of tableViewSelectionDidChange(_:) with the following:

func tableViewSelectionDidChange(_ notification: Notification) {
  guard let tableView = notification.object as? NSTableView else {
    return
  }
  textView.string = hardwarePosts[tableView.selectedRow].message
}

tableViewSelectionDidChange(_:) is called when the table view’s selection has changed. When that happens, this code gets the hardware post for the selected row and displays the message in the text view.

Build and run your project.

starter-final

All of the parsed fields are now neatly displayed on the table. Select a cell on the left, and you'll see the corresponding message on the right. Good Job!

Where to Go From Here?

Here’s the source code for the completed project
There is so much more you can do with the data you have parsed. You could write a formatter that converts a HardwarePost object into JSON, XML, CSV or any other formats. With your new-found flexibility to represent data in different forms, you can share your data across different platforms.

If you're interested in the study of computer languages and how they are implemented, take a class in comparative languages. Your course will likely cover formal languages and BNF grammars—all important concepts in the design and implementation of parsers.

For more information on Scanner and other parsing theory, check out the following resources:

If you have any questions or comments, please join the discussion below!

Hai Nguyen

Hai is an independent iOS / OS X developer with an architectural background. He enjoys engineering user interface and experimenting with parametric design.

Other Items of Interest

Big Book SaleAll raywenderlich.com iOS 11 books on sale for a limited time!

raywenderlich.com Weekly

Sign up to receive the latest tutorials from raywenderlich.com each week, and receive a free epic-length tutorial as a bonus!

Advertise with Us!

PragmaConf 2016 Come check out Alt U

Our Books

Our Team

Video Team

... 19 total!

iOS Team

... 73 total!

Android Team

... 20 total!

Unity Team

... 10 total!

Articles Team

... 15 total!

Resident Authors Team

... 18 total!

Podcast Team

... 7 total!

Recruitment Team

... 9 total!