Build in-demand dev skills — and level up fast.
Starting at just $19.99/mo.

Stay relevant to recruiters at top companies with over 4,000+ videos, 40+ books, exclusive Professional Development Seminars, and more.

Home iOS & Swift Tutorials

Vision Tutorial for iOS: Detect Body and Hand Pose

Learn how to detect the number of fingers shown to the camera with help from the Vision framework.

4.8/5 5 Ratings


  • Swift 5, iOS 14, Xcode 12

Machine learning is everywhere, so it came as no surprise when Apple announced its Core ML frameworks in 2017. Core ML comes with many tools including Vision, an image analysis framework. Vision analyzes still images to detect faces, read barcodes, track objects and more. Over the years, Apple added many cool features to this framework, including the Hand and Body Detection APIs, introduced in 2020. In this tutorial, you’ll use these Hand and Body Detection APIs from the Vision framework to bring a touch of magic to a game called StarCount. You’ll count the number of stars falling from the sky using your hands and fingers.

Note: This Vision tutorial assumes a working knowledge of SwiftUI, UIKit and Combine. For more information about SwiftUI, see SwiftUI: Getting Started.

StarCount needs a device with a front-facing camera to function, so you can’t follow along with a simulator.

Finally, it would help if you could prop up your device somewhere, you’ll need both hands to match those high numbers!

Getting Started

Download the starter project using the Download Materials button at the top or bottom of this page. Then, open the starter project in Xcode.

Build and run. Tap Rain in the top left corner and enjoy the scene. Don’t forget to wish on those stars!

Vision Tutorial Starting page with rain button

The magic of raining stars is in StarAnimatorView.swift. It uses UIKit Dynamics APIs. Feel free to take a look if you’re interested.

The app looks nice, but imagine how much better it would look if was showing live videos of you in the background! Vision can’t count your fingers if the phone can’t see them.

Getting Ready for Detection

Vision uses still images for detection. Believe it or not, what you see in the camera viewfinder is essentially a stream of still images. Before you can detect anything, you need to integrate a camera session into the game.

Creating the Camera Session

To show a camera preview in an app, you use AVCaptureVideoPreviewLayer, a subclass of CALayer. You use this preview layer in conjunction with a capture session.

Since CALayer is part of UIKit, you need to create a wrapper to use it in SwiftUI. Fortunately, Apple provides an easy way to do this using UIViewRepresentable and UIViewControllerRepresentable.

As a matter of fact, StarAnimator is a UIViewRepresentable so you can use StarAnimatorView, a subclass of UIView, in SwiftUI.

Note: You can learn more about integrating UIKit with SwiftUI in this great video course: Integrating UIKit & SwiftUI.

You’ll create three files in the following section: CameraPreview.swift, CameraViewController.swift and CameraView.swift. Start with CameraPreview.swift.


Create a new file named CameraPreview.swift in the StarCount group and add:

// 1
import UIKit
import AVFoundation

final class CameraPreview: UIView {
  // 2
  override class var layerClass: AnyClass {
  // 3
  var previewLayer: AVCaptureVideoPreviewLayer {
    layer as! AVCaptureVideoPreviewLayer 

Here, you:

  1. Import UIKit since CameraPreview is a subclass of UIView. You also import AVFoundation since AVCaptureVideoPreviewLayer is part of this module.
  2. Next, you override the static layerClass. This makes the root layer of this view of type AVCaptureVideoPreviewLayer.
  3. Then you create a computed property called previewLayer and force cast the root layer of this view to the type you defined in step two. Now you can use this property to access the layer directly when you need to work with it later.

Next, you’ll create a view controller to manage your CameraPreview.


The camera capture code from AVFoundation is designed to work with UIKit, so to get it working nicely in your SwiftUI app you need to make a view controller and wrap it in UIViewControllerRepresentable.

Create CameraViewController.swift in the StarCount group and add:

import UIKit

final class CameraViewController: UIViewController {
  // 1
  override func loadView() {
    view = CameraPreview()
  // 2
  private var cameraView: CameraPreview { view as! CameraPreview }

Here you:

  1. Override loadView to make the view controller use CameraPreview as its root view.
  2. Create a computed property called cameraPreview to access the root view as CameraPreview. You can safely force cast here because you recently assigned an instance of CameraPreview to view in step one.

Now, you’ll make a SwiftUI view to wrap your new view controller, so you can use it in StarCount.


Create CameraView.swift in the StarCount group and add:

import SwiftUI

// 1
struct CameraView: UIViewControllerRepresentable {
  // 2
  func makeUIViewController(context: Context) -> CameraViewController {
    let cvc = CameraViewController()
    return cvc

  // 3
  func updateUIViewController(
    _ uiViewController: CameraViewController, 
    context: Context
  ) {

This is what’s happening in the code above:

  1. You create a struct called CameraView which conforms to UIViewControllerRepresentable. This is a protocol for making SwiftUI View types that wrap UIKit view controllers.
  2. You implement the first protocol method, makeUIViewController. Here you initialize an instance of CameraViewController and perform any one time only setups.
  3. updateUIViewController(_: context:) is the other required method of this protocol, where you would make any updates to the view controller based on changes to the SwiftUI data or hierarchy. For this app, you don’t need to do anything here.

After all this work, it’s time to use CameraView in ContentView.

Open ContentView.swift. Insert CameraView at the beginning of the ZStack in body:


Phew! That was a long section. Build and run to see your camera preview.

Starting page after integrating first version of CameraView

Huh! All that work and nothing changed! Why? There’s another piece of the puzzle to add before camera previewing works, an AVCaptureSession. You’ll add that next.

Connecting to the Camera Session

The changes you’ll make here seem long but don’t be afraid. They’re mostly boilerplate code.

Open CameraViewController.swift. Add the following after import UIKit:

import AVFoundation 

Then, add an instance property of type AVCaptureSession inside the class:

private var cameraFeedSession: AVCaptureSession?

It’s good practice to run the capture session when this view controller appears on screen and stop the session when the view is no longer visible, so add the following:

override func viewDidAppear(_ animated: Bool) {
  do {
    // 1
    if cameraFeedSession == nil {
      // 2
      try setupAVSession()
      // 3
      cameraView.previewLayer.session = cameraFeedSession
      cameraView.previewLayer.videoGravity = .resizeAspectFill
    // 4
  } catch {

// 5
override func viewWillDisappear(_ animated: Bool) {

func setupAVSession() throws {

Here’s a code breakdown:

  1. In viewDidAppear(_:), you check to see if you’ve already initialized cameraFeedSession.
  2. You call setupAVSession(), which is empty for now, but you’ll implement it shortly.
  3. Then, you set the session into the session of the previewLayer of cameraView and set the resize mode of the video.
  4. Next, you start running the session. This makes the camera feed visible.
  5. In viewWillDisappear(_:), turn off the camera feed to preserve battery life and be a good citizen.

Now, you’ll add the missing code to prepare the camera.

Preparing the Camera

Add a new property for the dispatch queue on which Vision will process the camera samples:

private let videoDataOutputQueue = DispatchQueue(
  label: "CameraFeedOutput", 
  qos: .userInteractive

Add an extension to make the view controller conform to AVCaptureVideoDataOutputSampleBufferDelegate:

CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {

With those two things in place, you can now replace the empty setupAVSession():

func setupAVSession() throws {
  // 1
  guard let videoDevice = AVCaptureDevice.default(
    for: .video, 
    position: .front) 
  else {
    throw AppError.captureSessionSetup(
      reason: "Could not find a front facing camera."

  // 2
    let deviceInput = try? AVCaptureDeviceInput(device: videoDevice)
  else {
    throw AppError.captureSessionSetup(
      reason: "Could not create video device input."

  // 3
  let session = AVCaptureSession()
  session.sessionPreset = AVCaptureSession.Preset.high

  // 4
  guard session.canAddInput(deviceInput) else {
    throw AppError.captureSessionSetup(
      reason: "Could not add video device input to the session"

  // 5
  let dataOutput = AVCaptureVideoDataOutput()
  if session.canAddOutput(dataOutput) {
    dataOutput.alwaysDiscardsLateVideoFrames = true
    dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
  } else {
    throw AppError.captureSessionSetup(
      reason: "Could not add video data output to the session"
  // 6
  cameraFeedSession = session

In the code above you:

  1. Check if the device has a front-facing camera. If it doesn’t, you throw an error.
  2. Next, check if you can use the camera to create a capture device input.
  3. Create a capture session and start configuring it using the high quality preset.
  4. Then check if the session can integrate the capture device input. If yes, add the input you created in step two to the session. You need an input and an output for your session to work.
  5. Next, create a data output and add it to the session. The data output will take samples of images from the camera feed and provide them in a delegate on a defined dispatch queue, which you set up earlier.
  6. Finally, finish configuring the session and assign it to the property you created before.

Build and run. Now you can see yourself behind the raining stars.

Starting page after setting up the camera viewfinder

Note: You need user permission to access the camera on a device. When you start a camera session for the first time, iOS prompts the user to grant access to the camera. You have to give the user a reason why you want the camera permission.

A key-value pair in Info.plist stores the reason. It’s already there in the starter project.

With that in place, it’s time to move on to Vision.

Detecting Hands

To use any algorithm in Vision, you generally follow these three steps:

  1. Request: You request the framework detect something for you by defining request characteristics. You use an appropriate subclass of VNRequest.
  2. Handler: Next, you ask the framework to perform a method after the request finishes executing or handling the request.
  3. Observation: Finally, you get potential results or observations back. These observations are instances of VNObservation based on the request you made.

You’ll deal with the request first.


The request for detecting hands is of type VNDetectHumanHandPoseRequest.

Still in CameraViewController.swift, add the following after import AVFoundation to access the Vision framework:

import Vision

Then, inside the class definition, create this instance property:

private let handPoseRequest: VNDetectHumanHandPoseRequest = {
  // 1
  let request = VNDetectHumanHandPoseRequest()
  // 2
  request.maximumHandCount = 2
  return request

Here you:

  1. Create a request for detecting human hands.
  2. Set the maximum number of hands to detect to two. Vision framework is powerful. It can detect many hands in an image. Since a maximum of ten stars falls in any single drop, two hands with ten fingers will suffice.

Now, it’s time to set up the handler and observation.

Handler and Observation

You can use AVCaptureVideoDataOutputSampleBufferDelegate to get a sample out of the capture stream and start the detection process.

Implement this method in CameraViewController‘s extension, which you created earlier:

func captureOutput(
  _ output: AVCaptureOutput, 
  didOutput sampleBuffer: CMSampleBuffer, 
  from connection: AVCaptureConnection
) {
  // 1
  let handler = VNImageRequestHandler(
    cmSampleBuffer: sampleBuffer, 
    orientation: .up, 
    options: [:]

  do {
    // 2
    try handler.perform([handPoseRequest])

    // 3
      let results = handPoseRequest.results?.prefix(2), 
    else {

  } catch {
    // 4

Here’s a code breakdown:

  1. captureOutput(_:didOutput:from:) is called whenever a sample is available. In this method, you create a handler, which is the second step needed to use Vision. You pass the sample buffer you get as an input parameter to perform the request on a single image.
  2. Then, you perform the request. If there are any errors, this method throws them, so it’s in a do-catch block.

    Performing requests is a synchronous operation. Remember the dispatch queue you provided to the delegate callback? That ensures you don’t block the main queue.

    Vision completes the detection process on that background queue.

  3. You get the detection results, or observations, using the request’s results. Here you get the first two items and make sure the results array isn’t empty. As you only asked for two hands when creating the request, this is an extra precaution to ensure you don’t get more than two result items.
    Next, you print the results to the console.
  4. If the request fails, it means something bad happened. In a production environment, you would handle this error better. For now, you can stop the camera session.

Build and run. Put your hands in front of the camera and check out the Xcode console.

Hand in the camera viewfinder

Xcode console when a hand is on camera

In the console, you see the observation objects of type VNHumanHandPoseObservation are visible. Next, you’ll extract finger data from these observations. But first, you need to read up on anatomy!

Gif showing a confused man

Anatomy to the Rescue!

Vision framework detects hands in a detailed manner. Check out this illustration:

Hand showing landmarks

Each of the circles on this image is a Landmark. Vision can detect a total of 21 landmarks for each hand: four for each finger, four for the thumb and one for the wrist.

Each of these fingers is in a Joints Group, depicted by the API in VNHumanHandPoseObservation.JointsGroupName as:

  • .thumb
  • .indexFinger
  • .middleFinger
  • .ringFinger
  • .littleFinger

In each joints group, every individual joint has a name:

  • TIP: The tip of the finger.
  • DIP: Distal interphalangeal joint or the first joint after the finger tip.
  • PIP: Proximal interphalangeal joint or the middle joint.
  • MIP: The metacarpophalangeal joint is at the bottom of the finger where it joins the palm.

Finger joints names

The thumb is a bit different. It has a TIP, but the other joints have different names:

  • TIP: The tip of the thumb.
  • IP: Interphalangeal joint to the first joint after the tip of the thumb.
  • MP: The metacarpophalangeal joint is at the bottom of the thumb where it joins the palm.
  • CMC: The carpometacarpal joint is near the wrist.

Thumb joints names

Many developers don’t think they need math in their careers. Who would’ve thought anatomy would be a prerequisite, too?

With anatomy covered, it’s time to detect fingertips.

Detecting Fingertips

To make things simple, you’ll detect fingertips and draw an overlay on top.

In CameraViewController.swift, add the following to the top of captureOutput(_:didOutput:from:):

var fingerTips: [CGPoint] = []

This will store the detected fingertips. Now replace print(results), which you added in an earlier step, with:

var recognizedPoints: [VNRecognizedPoint] = []

try results.forEach { observation in
  // 1
  let fingers = try observation.recognizedPoints(.all)

  // 2
  if let thumbTipPoint = fingers[.thumbTip] {
  if let indexTipPoint = fingers[.indexTip] {
  if let middleTipPoint = fingers[.middleTip] {
  if let ringTipPoint = fingers[.ringTip] {
  if let littleTipPoint = fingers[.littleTip] {

// 3
fingerTips = recognizedPoints.filter {
  // Ignore low confidence points.
  $0.confidence > 0.9
.map {
  // 4
  CGPoint(x: $0.location.x, y: 1 - $0.location.y)

Here you:

  1. Get the points for all fingers.
  2. Look for tip points.
  3. Each VNRecognizedPoint has a confidence. You only want observations with high confidence levels.
  4. Vision algorithms use a coordinate system with lower left origin and return normalized values relative to the pixel dimension of the input image. AVFoundation coordinates have an upper-left origin, so you convert the y-coordinate.

You need to do something with those finger tips, so add the following to CameraViewController:

// 1
var pointsProcessorHandler: (([CGPoint]) -> Void)?

func processPoints(_ fingerTips: [CGPoint]) {
  // 2
  let convertedPoints = {
    cameraView.previewLayer.layerPointConverted(fromCaptureDevicePoint: $0)

  // 3

Here you:

  1. Add a property for the closure to run when the framework detects points.
  2. Convert from AVFoundation relative coordinates to UIKit coordinates so you can draw them on screen. You use layerPointConverted, which is a method in AVCaptureVideoPreviewLayer.
  3. You call the closure with the converted points.

In captureOutput(_:didOutput:from:), just after you declare the fingerTips property, add:

defer {
  DispatchQueue.main.sync {

This will send your finger tips to be processed on the main queue once the method is finished.

Time to show those fingertips to the user!

Displaying Fingertips

pointsProcessorHandler is going to get your detected fingerprints on the screen. You have to pass the closure from SwiftUI to this view controller.

Go back to CameraView.swift and add a new property:

var pointsProcessorHandler: (([CGPoint]) -> Void)?

This gives you a place to store the closure in the view.

Then update makeUIViewController(context:) by adding this line before the return statement:

cvc.pointsProcessorHandler = pointsProcessorHandler

This passes the closure to the view controller.

Open ContentView.swift and add the following property to the view definition:

@State private var overlayPoints: [CGPoint] = []

This state variable will hold the points grabbed in CameraView. Replace the CameraView() line with the following:

CameraView {
  overlayPoints = $0

This closure is the pointsProcessorHandler you added earlier and is called when you have the detected points. In the closure, you assign the points to overlayPoints.

Finally, add this modifier before the edgesIgnoringSafeArea(.all) modifier:

  FingersOverlay(with: overlayPoints)

You’re adding an overlay modifier to CameraView. Inside that modifier, you initialize FingersOverlay with the detected points and set the color to orange.
FingersOverlay.swift is inside the starter project. Its only job is to draw points on screen.

Build and run. Check out the orange dots on your fingers. Move your hands around and notice the dots follow your fingers.

Fingers overlay gif

Note: Feel free to change the color in the .overlay modifier if you need.

It’s finally time to add game logic.

Adding Game Logic

While the logic of the game is long, it’s pretty straightforward.

Open GameLogicController.swift and replace the class implementation with:

// 1
private var goalCount = 0

// 2
@Published var makeItRain = false

// 3
@Published private(set) var successBadge: Int?

// 4
private var shouldEvaluateResult = true

// 5
func start() {
  makeItRain = true

// 6
func didRainStars(count: Int) {
  goalCount = count

// 7
func checkStarsCount(_ count: Int) {
  if !shouldEvaluateResult {
  if count == goalCount {
    shouldEvaluateResult = false
    successBadge = count

    DispatchQueue.main.asyncAfter(deadline: .now() + 3) {
      self.successBadge = nil
      self.makeItRain = true
      self.shouldEvaluateResult = true

Here’s a breakdown:

  1. This property stores the number of dropped stars. The player has to guess this value by showing the appropriate number of fingers.
  2. Whenever something sets this published property to true, StarAnimator starts raining.
  3. If the player correctly guesses the number of dropped stars, you assign the goal count to this. The value appears on screen indicating success.
  4. This property prevents an excessive amount of evaluation. If the player guesses the value correctly, this property makes the evaluation stop.
  5. This is how the game starts. You call this when the starting screen appears.
  6. When StarAnimator rains a specific amount of stars, it calls this method to save the goal count in the game’s engine.
  7. This is where the magic happens. You call this method whenever new points are available. It first checks if evaluating the result is possible. If the guessed value is correct, it stops the evaluation, sets the success badge value and resets the engine’s state to initial values after three seconds.

Open ContentView.swift to connect the GameLogicController.

Replace the call to StarAnimator, including its trailing closure, with:

StarAnimator(makeItRain: $gameLogicController.makeItRain) {
  gameLogicController.didRainStars(count: $0)

This code reports the number of rained stars to the game engine.

Next, you’ll let the player know they’ve got the answer right.

Adding a Success Badge

Add a computed property for successBadge as follows:

private var successBadge: some View {
  if let number = gameLogicController.successBadge {
    Image(systemName: "\(number).circle.fill")
      .frame(width: 200, height: 200)
      .shadow(radius: 5)
  } else {

If the successBadge of the game logic controller has a value, you create an image using a system image available in SFSymbols. Otherwise, you return an EmptyView which means drawing nothing.

Add these two modifiers to the root ZStack:

.onAppear {
  // 1
  // 2

Here’s what you added:

  1. When the starting page of the game appears, you start the game.
  2. You draw the success badge on top of everything. The successBadge implementation comes next.

Next, remove the overlay for Rain since it now rains automatically.

Final Step

To make the game work, you need to pass the number of detected points to the game engine. Update the closure that you pass when initializing CameraView in ContentView:

CameraView {
  overlayPoints = $0

Build and run. Enjoy the game.

Final game play

More Use Cases

You barely scratched the surface of Hand and Body Detection APIs in Vision. The framework can detect several body landmarks, as illustrated below:

Body landmarks illustration

Here are some examples of what you can do with these APIs:

  • Install UI controls in your app using Vision framework. For example, some camera apps include features that let you show a hand gesture to take a picture.
  • Build a fun emoji app where the user could show the emoji with hands.
  • Build a workout analysis app where the user could find out if he or she is doing a specific action right.
  • Build a music app to teach the user to play guitar or ukulele.

Hand detection in Ukelele

Body detection in Workout

The possibilities are endless with Vision.

Where to Go From Here?

Download the final project using the Download Materials button at the top or bottom of this tutorial.

You’ve successfully created a game with Hand Detection APIs! Great job!

There are many great resources for Vision and these specific APIs. To explore this topic in more depth, try:

I hope you liked this tutorial. If you have any questions or comments, please join the discussion below!




More like this