ShazamKit Tutorial for iOS: Getting Started

Learn how to use ShazamKit to find information about specific audio recordings by matching a segment of that audio against a reference catalog of audio signatures. By Saleh Albuga.

Leave a rating/review
Download materials
Save for later
Share

You’ve probably heard a song you liked in a restaurant and wanted to know its name and artist. In this situation, the first thing that comes to mind is Shazam.

You simply open Shazam, tap recognize and voilà! The song info is right on your phone.

Apple acquired Shazam in 2018. With the release of Xcode 13 and iOS 15, Apple introduced ShazamKit, a framework you can use to add audio recognition experiences to your apps. Whether you want to show users what song is playing or match a track or a video you created, ShazamKit has got you more than covered.

In this tutorial, you’ll:

  • Understand Shazam’s recognition mechanism.
  • Create DevCompanion, a simple Shazam clone that matches popular, published music and songs.
  • Match custom audio from a video.
  • Change the app content depending on the video playing position.

For this tutorial, you should be familiar with the Shazam app or matching music with Siri. Don’t worry if you’re not. Just play a song on your laptop and ask Siri, “What’s this song?” or download the Shazam app.

Getting Started

Download the starter project by clicking Download Materials at the top or bottom of the tutorial. Open the project, then build and run.

Simulator showing the starter project running

DevCompanion has two views:

  • What’s playing?: Where users can match popular music, just like Shazam.
  • Video Content: Where users can see annotations and additional content while watching a SwiftUI video course here on raywenderlich.com.

Open MatchingHelper.swift and take a look at the code. It’s an empty helper class where you’ll write ShazamKit recognition code.

Don’t worry about the rest of the files for now. You’ll see them later in the tutorial when you create a custom audio experience. For now, you’ll learn more about how Shazam recognizes and matches audio.

You’ll also need an Apple Developer account in order to configure an App ID with the ShazamKit App Service.

Note: For this tutorial, you’ll need the latest version of Xcode 13 and a device running iOS 15. As of the time of writing, the Simulator doesn’t support ShazamKit.

Understanding Shazam’s Matching Mechanism

Before writing code and using the ShazamKit API, it’s essential to understand how Shazam works behind the scenes. This technology is exciting!

When you use Shazam, you tap the big recognition button, Tap to Shazam, while a song is playing. The app listens for a couple of seconds and then displays the song information if it finds a match. You can match any part of a song.

This is what goes under the hood:

  1. The app starts using the microphone to record a stream with a predefined buffer size.
  2. The Shazam library, now called ShazamKit, generates a signature from the audio buffer the app just recorded.
  3. Then, ShazamKit sends a query request with this audio signature to the Shazam API. The Shazam service matches the signature against reference signatures of popular music in the Shazam Catalog.
  4. If there’s a match, the API returns the metadata of the track to ShazamKit.
  5. ShazamKit calls the right delegate passing the metadata.
  6. Beyond this point, it’s up to the app logic to display the result with the track information.

Next, you’ll learn more about Shazam signatures and catalogs.

Shazam Signatures

Signatures are a fundamental part of the identification process. A signature is a lossy or simplified version of the song that’s easier to process and analyze. Shazam starts creating signatures by generating the spectrogram of the recorded part, then extracting and identifying the highs or the loudest parts.

A signature is not reversible to the original audio to ensure the original audio’s privacy.

During the identification process, Shazam matches query signatures sent by apps against reference signatures. A reference signature is a signature generated from the whole song or track.

Instead of comparing the recorded audio as is, there are many benefits to using signatures in identification. For example, Shazam signatures prevent most background noises from affecting the matching process, ensuring matching even in noisy conditions.
Signatures are also easier to share, store and index as they have a much smaller footprint than the original audio.

You can learn more about Shazam’s algorithm in this research paper by the founder of Shazam, Avery Wang.

Next, you’ll explore Shazam catalogs.

Shazam Catalogs

As mentioned earlier, Shazam matches signatures against reference signatures. It stores reference signatures and their metadata in catalogs. A signature’s metadata has information about the song, like its name, artist and artwork.

Shazam Catalog Illustration

The Shazam Catalog has almost all popular songs’ reference signatures and metadata. You can also create a custom catalog locally in an app and store reference signatures and metadata for your audio tracks. You’ll create custom catalogs later in this tutorial.

Enough theory for now. Next, you’ll learn how to make the app identify popular music.

Matching Music Against Shazam’s Catalog

Time to implement the app’s first feature, a simplified Shazam clone. Open MatchingHelper.swift and look at the code:

import AVFAudio
import Foundation
import ShazamKit

class MatchingHelper: NSObject {
  private var session: SHSession?
  private let audioEngine = AVAudioEngine()

  private var matchHandler: ((SHMatchedMediaItem?, Error?) -> Void)?

  init(matchHandler handler: ((SHMatchedMediaItem?, Error?) -> Void)?) {
    matchHandler = handler
  }
}

It’s a helper class that controls the microphone and uses ShazamKit to identify audio. At the top, you can see the code imports ShazamKit along with AVFAudio. You’ll need AVFAudio to use the microphone and capture audio.

MatchingHelper also subclasses NSObject since that’s required by any class that conforms to SHSessionDelegate.

Take a look at MatchingHelper‘s properties:

  • session: The ShazamKit session you’ll use to communicate with the Shazam service.
  • audioEngine: An AVAudioEngine instance you’ll use to capture audio from the microphone.
  • matchHandler: A handler block the app views will implement. It’s called when the identification process finishes.

The initializer makes sure matchHandler is set when you create an instance of the class.

Add the following method below the initializer:

func match(catalog: SHCustomCatalog? = nil) throws {
  // 1. Instantiate SHSession
  if let catalog = catalog {
    session = SHSession(catalog: catalog)
  } else {
    session = SHSession()
  }

  // 2. Set SHSession delegate
  session?.delegate = self

  // 3. Prepare to capture audio
  let audioFormat = AVAudioFormat(
    standardFormatWithSampleRate: 
      audioEngine.inputNode.outputFormat(forBus: 0).sampleRate,
    channels: 1)
  audioEngine.inputNode.installTap(
    onBus: 0,
    bufferSize: 2048,
    format: audioFormat
  ) { [weak session] buffer, audioTime in 
    // callback with the captured audio buffer
    session?.matchStreamingBuffer(buffer, at: audioTime)
  }

  // 4. Start capture audio using AVAudioEngine
  try AVAudioSession.sharedInstance().setCategory(.record)
  AVAudioSession.sharedInstance()
    .requestRecordPermission { [weak self] success in
      guard
        success,
        let self = self
      else { return }
      try? self.audioEngine.start()
    }
}

match(catalog:) is the method the rest of the app’s code will use to identify audio with ShazamKit. It takes one optional parameter of type SHCustomCatalog if you want to match against a custom catalog.

Take a look at each step:

SHSession defaults to the Shazam Catalog if you don’t provide a catalog, which will work for the first part of the app.

Finally, you start recording by calling AVAudioEngine.start().

  1. First, you create an SHSession and pass a catalog to it if you use a custom catalog.
  2. You set the SHSession delegate, which you’ll implement in a moment.
  3. You call AVAudioEngine‘s AVAudioNode.installTap(onBus:bufferSize:format:block:), a method that prepares the audio input node. In the callback, which is passed the captured audio buffer, you call SHSession.matchStreamingBuffer(_:at:). This converts the audio in the buffer to a Shazam signature and matches against the reference signatures in the selected catalog.
  4. You set AVAudioSession category, or mode, to recording. Then, you request microphone recording permission by calling AVAudioSession‘s requestRecordPermission(_:) to ask the user for the microphone permission the first time the app runs.
Note: NSMicrophoneUsageDescription is already set in the project’s Info.plist.

matchStreamingBuffer(_:at:) handles capturing audio and passing it to ShazamKit. Alternatively, you can use SHSignatureGenerator to generate a signature object and pass it to the match of SHSession. However, matchStreamingBuffer(_:at:) is suitable for contiguous audio and therefore fits your use case.

Next, you’ll implement the Shazam Session delegate.