Vision Tutorial for iOS: Detect Body and Hand Pose
Learn how to detect the number of fingers shown to the camera with help from the Vision framework.
Version
- Swift 5, iOS 14, Xcode 12

Machine learning is everywhere, so it came as no surprise when Apple announced its Core ML frameworks in 2017. Core ML comes with many tools including Vision, an image analysis framework. Vision analyzes still images to detect faces, read barcodes, track objects and more. Over the years, Apple added many cool features to this framework, including the Hand and Body Detection APIs, introduced in 2020. In this tutorial, you’ll use these Hand and Body Detection APIs from the Vision framework to bring a touch of magic to a game called StarCount. You’ll count the number of stars falling from the sky using your hands and fingers.
StarCount needs a device with a front-facing camera to function, so you can’t follow along with a simulator.
Finally, it would help if you could prop up your device somewhere, you’ll need both hands to match those high numbers!
Getting Started
Download the starter project using the Download Materials button at the top or bottom of this page. Then, open the starter project in Xcode.
Build and run. Tap Rain in the top left corner and enjoy the scene. Don’t forget to wish on those stars!
The magic of raining stars is in StarAnimatorView.swift. It uses UIKit Dynamics APIs. Feel free to take a look if you’re interested.
The app looks nice, but imagine how much better it would look if was showing live videos of you in the background! Vision can’t count your fingers if the phone can’t see them.
Getting Ready for Detection
Vision uses still images for detection. Believe it or not, what you see in the camera viewfinder is essentially a stream of still images. Before you can detect anything, you need to integrate a camera session into the game.
Creating the Camera Session
To show a camera preview in an app, you use AVCaptureVideoPreviewLayer
, a subclass of CALayer
. You use this preview layer in conjunction with a capture session.
Since CALayer
is part of UIKit, you need to create a wrapper to use it in SwiftUI. Fortunately, Apple provides an easy way to do this using UIViewRepresentable
and UIViewControllerRepresentable
.
As a matter of fact, StarAnimator
is a UIViewRepresentable
so you can use StarAnimatorView
, a subclass of UIView
, in SwiftUI.
You’ll create three files in the following section: CameraPreview.swift, CameraViewController.swift and CameraView.swift. Start with CameraPreview.swift.
CameraPreview
Create a new file named CameraPreview.swift in the StarCount group and add:
// 1
import UIKit
import AVFoundation
final class CameraPreview: UIView {
// 2
override class var layerClass: AnyClass {
AVCaptureVideoPreviewLayer.self
}
// 3
var previewLayer: AVCaptureVideoPreviewLayer {
layer as! AVCaptureVideoPreviewLayer
}
}
Here, you:
- Import
UIKit
sinceCameraPreview
is a subclass ofUIView
. You also importAVFoundation
sinceAVCaptureVideoPreviewLayer
is part of this module. - Next, you override the static
layerClass
. This makes the root layer of this view of typeAVCaptureVideoPreviewLayer
. - Then you create a computed property called
previewLayer
and force cast the root layer of this view to the type you defined in step two. Now you can use this property to access the layer directly when you need to work with it later.
Next, you’ll create a view controller to manage your CameraPreview
.
CameraViewController
The camera capture code from AVFoundation
is designed to work with UIKit
, so to get it working nicely in your SwiftUI app you need to make a view controller and wrap it in UIViewControllerRepresentable
.
Create CameraViewController.swift in the StarCount group and add:
import UIKit
final class CameraViewController: UIViewController {
// 1
override func loadView() {
view = CameraPreview()
}
// 2
private var cameraView: CameraPreview { view as! CameraPreview }
}
Here you:
- Override
loadView
to make the view controller useCameraPreview
as its root view. - Create a computed property called
cameraPreview
to access the root view asCameraPreview
. You can safely force cast here because you recently assigned an instance ofCameraPreview
toview
in step one.
Now, you’ll make a SwiftUI view to wrap your new view controller, so you can use it in StarCount.
CameraView
Create CameraView.swift in the StarCount group and add:
import SwiftUI
// 1
struct CameraView: UIViewControllerRepresentable {
// 2
func makeUIViewController(context: Context) -> CameraViewController {
let cvc = CameraViewController()
return cvc
}
// 3
func updateUIViewController(
_ uiViewController: CameraViewController,
context: Context
) {
}
}
This is what’s happening in the code above:
- You create a struct called
CameraView
which conforms toUIViewControllerRepresentable
. This is a protocol for making SwiftUIView
types that wrap UIKit view controllers. - You implement the first protocol method,
makeUIViewController
. Here you initialize an instance ofCameraViewController
and perform any one time only setups. -
updateUIViewController(_: context:)
is the other required method of this protocol, where you would make any updates to the view controller based on changes to the SwiftUI data or hierarchy. For this app, you don’t need to do anything here.
After all this work, it’s time to use CameraView
in ContentView
.
Open ContentView.swift. Insert CameraView
at the beginning of the ZStack
in body
:
CameraView()
.edgesIgnoringSafeArea(.all)
Phew! That was a long section. Build and run to see your camera preview.
Huh! All that work and nothing changed! Why? There’s another piece of the puzzle to add before camera previewing works, an AVCaptureSession
. You’ll add that next.
Connecting to the Camera Session
The changes you’ll make here seem long but don’t be afraid. They’re mostly boilerplate code.
Open CameraViewController.swift. Add the following after import UIKit
:
import AVFoundation
Then, add an instance property of type AVCaptureSession
inside the class:
private var cameraFeedSession: AVCaptureSession?
It’s good practice to run the capture session when this view controller appears on screen and stop the session when the view is no longer visible, so add the following:
override func viewDidAppear(_ animated: Bool) {
super.viewDidAppear(animated)
do {
// 1
if cameraFeedSession == nil {
// 2
try setupAVSession()
// 3
cameraView.previewLayer.session = cameraFeedSession
cameraView.previewLayer.videoGravity = .resizeAspectFill
}
// 4
cameraFeedSession?.startRunning()
} catch {
print(error.localizedDescription)
}
}
// 5
override func viewWillDisappear(_ animated: Bool) {
cameraFeedSession?.stopRunning()
super.viewWillDisappear(animated)
}
func setupAVSession() throws {
}
Here’s a code breakdown:
- In
viewDidAppear(_:)
, you check to see if you’ve already initializedcameraFeedSession
. - You call
setupAVSession()
, which is empty for now, but you’ll implement it shortly. - Then, you set the session into the
session
of thepreviewLayer
ofcameraView
and set the resize mode of the video. - Next, you start running the session. This makes the camera feed visible.
- In
viewWillDisappear(_:)
, turn off the camera feed to preserve battery life and be a good citizen.
Now, you’ll add the missing code to prepare the camera.
Preparing the Camera
Add a new property for the dispatch queue on which Vision will process the camera samples:
private let videoDataOutputQueue = DispatchQueue(
label: "CameraFeedOutput",
qos: .userInteractive
)
Add an extension to make the view controller conform to AVCaptureVideoDataOutputSampleBufferDelegate
:
extension
CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate {
}
With those two things in place, you can now replace the empty setupAVSession()
:
func setupAVSession() throws {
// 1
guard let videoDevice = AVCaptureDevice.default(
.builtInWideAngleCamera,
for: .video,
position: .front)
else {
throw AppError.captureSessionSetup(
reason: "Could not find a front facing camera."
)
}
// 2
guard
let deviceInput = try? AVCaptureDeviceInput(device: videoDevice)
else {
throw AppError.captureSessionSetup(
reason: "Could not create video device input."
)
}
// 3
let session = AVCaptureSession()
session.beginConfiguration()
session.sessionPreset = AVCaptureSession.Preset.high
// 4
guard session.canAddInput(deviceInput) else {
throw AppError.captureSessionSetup(
reason: "Could not add video device input to the session"
)
}
session.addInput(deviceInput)
// 5
let dataOutput = AVCaptureVideoDataOutput()
if session.canAddOutput(dataOutput) {
session.addOutput(dataOutput)
dataOutput.alwaysDiscardsLateVideoFrames = true
dataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
} else {
throw AppError.captureSessionSetup(
reason: "Could not add video data output to the session"
)
}
// 6
session.commitConfiguration()
cameraFeedSession = session
}
In the code above you:
- Check if the device has a front-facing camera. If it doesn’t, you throw an error.
- Next, check if you can use the camera to create a capture device input.
- Create a capture session and start configuring it using the high quality preset.
- Then check if the session can integrate the capture device input. If yes, add the input you created in step two to the session. You need an input and an output for your session to work.
- Next, create a data output and add it to the session. The data output will take samples of images from the camera feed and provide them in a delegate on a defined dispatch queue, which you set up earlier.
- Finally, finish configuring the session and assign it to the property you created before.
Build and run. Now you can see yourself behind the raining stars.
A key-value pair in Info.plist stores the reason. It’s already there in the starter project.
With that in place, it’s time to move on to Vision.
Detecting Hands
To use any algorithm in Vision, you generally follow these three steps:
-
Request: You request the framework detect something for you by defining request characteristics. You use an appropriate subclass of
VNRequest
. - Handler: Next, you ask the framework to perform a method after the request finishes executing or handling the request.
-
Observation: Finally, you get potential results or observations back. These observations are instances of
VNObservation
based on the request you made.
You’ll deal with the request first.
Request
The request for detecting hands is of type VNDetectHumanHandPoseRequest
.
Still in CameraViewController.swift, add the following after import AVFoundation
to access the Vision framework:
import Vision
Then, inside the class definition, create this instance property:
private let handPoseRequest: VNDetectHumanHandPoseRequest = {
// 1
let request = VNDetectHumanHandPoseRequest()
// 2
request.maximumHandCount = 2
return request
}()
Here you:
- Create a request for detecting human hands.
- Set the maximum number of hands to detect to two. Vision framework is powerful. It can detect many hands in an image. Since a maximum of ten stars falls in any single drop, two hands with ten fingers will suffice.
Now, it’s time to set up the handler and observation.
Handler and Observation
You can use AVCaptureVideoDataOutputSampleBufferDelegate
to get a sample out of the capture stream and start the detection process.
Implement this method in CameraViewController
‘s extension, which you created earlier:
func captureOutput(
_ output: AVCaptureOutput,
didOutput sampleBuffer: CMSampleBuffer,
from connection: AVCaptureConnection
) {
// 1
let handler = VNImageRequestHandler(
cmSampleBuffer: sampleBuffer,
orientation: .up,
options: [:]
)
do {
// 2
try handler.perform([handPoseRequest])
// 3
guard
let results = handPoseRequest.results?.prefix(2),
!results.isEmpty
else {
return
}
print(results)
} catch {
// 4
cameraFeedSession?.stopRunning()
}
}
Here’s a code breakdown:
-
captureOutput(_:didOutput:from:)
is called whenever a sample is available. In this method, you create a handler, which is the second step needed to use Vision. You pass the sample buffer you get as an input parameter to perform the request on a single image. - Then, you perform the request. If there are any errors, this method throws them, so it’s in a do-catch block.
Performing requests is a synchronous operation. Remember the dispatch queue you provided to the delegate callback? That ensures you don’t block the main queue.
Vision completes the detection process on that background queue.
- You get the detection results, or observations, using the request’s
results
. Here you get the first two items and make sure the results array isn’t empty. As you only asked for two hands when creating the request, this is an extra precaution to ensure you don’t get more than two result items.
Next, you print the results to the console. - If the request fails, it means something bad happened. In a production environment, you would handle this error better. For now, you can stop the camera session.
Build and run. Put your hands in front of the camera and check out the Xcode console.
In the console, you see the observation objects of type VNHumanHandPoseObservation
are visible. Next, you’ll extract finger data from these observations. But first, you need to read up on anatomy!
Anatomy to the Rescue!
Vision framework detects hands in a detailed manner. Check out this illustration:
Each of the circles on this image is a Landmark. Vision can detect a total of 21 landmarks for each hand: four for each finger, four for the thumb and one for the wrist.
Each of these fingers is in a Joints Group, depicted by the API in VNHumanHandPoseObservation.JointsGroupName
as:
.thumb
.indexFinger
.middleFinger
.ringFinger
.littleFinger
In each joints group, every individual joint has a name:
- TIP: The tip of the finger.
- DIP: Distal interphalangeal joint or the first joint after the finger tip.
- PIP: Proximal interphalangeal joint or the middle joint.
- MIP: The metacarpophalangeal joint is at the bottom of the finger where it joins the palm.
The thumb is a bit different. It has a TIP, but the other joints have different names:
- TIP: The tip of the thumb.
- IP: Interphalangeal joint to the first joint after the tip of the thumb.
- MP: The metacarpophalangeal joint is at the bottom of the thumb where it joins the palm.
- CMC: The carpometacarpal joint is near the wrist.
Many developers don’t think they need math in their careers. Who would’ve thought anatomy would be a prerequisite, too?
With anatomy covered, it’s time to detect fingertips.
Detecting Fingertips
To make things simple, you’ll detect fingertips and draw an overlay on top.
In CameraViewController.swift, add the following to the top of captureOutput(_:didOutput:from:)
:
var fingerTips: [CGPoint] = []
This will store the detected fingertips. Now replace print(results)
, which you added in an earlier step, with:
var recognizedPoints: [VNRecognizedPoint] = []
try results.forEach { observation in
// 1
let fingers = try observation.recognizedPoints(.all)
// 2
if let thumbTipPoint = fingers[.thumbTip] {
recognizedPoints.append(thumbTipPoint)
}
if let indexTipPoint = fingers[.indexTip] {
recognizedPoints.append(indexTipPoint)
}
if let middleTipPoint = fingers[.middleTip] {
recognizedPoints.append(middleTipPoint)
}
if let ringTipPoint = fingers[.ringTip] {
recognizedPoints.append(ringTipPoint)
}
if let littleTipPoint = fingers[.littleTip] {
recognizedPoints.append(littleTipPoint)
}
}
// 3
fingerTips = recognizedPoints.filter {
// Ignore low confidence points.
$0.confidence > 0.9
}
.map {
// 4
CGPoint(x: $0.location.x, y: 1 - $0.location.y)
}
Here you:
- Get the points for all fingers.
- Look for tip points.
- Each
VNRecognizedPoint
has aconfidence
. You only want observations with high confidence levels. - Vision algorithms use a coordinate system with lower left origin and return normalized values relative to the pixel dimension of the input image. AVFoundation coordinates have an upper-left origin, so you convert the y-coordinate.
You need to do something with those finger tips, so add the following to CameraViewController
:
// 1
var pointsProcessorHandler: (([CGPoint]) -> Void)?
func processPoints(_ fingerTips: [CGPoint]) {
// 2
let convertedPoints = fingerTips.map {
cameraView.previewLayer.layerPointConverted(fromCaptureDevicePoint: $0)
}
// 3
pointsProcessorHandler?(convertedPoints)
}
Here you:
- Add a property for the closure to run when the framework detects points.
- Convert from AVFoundation relative coordinates to UIKit coordinates so you can draw them on screen. You use
layerPointConverted
, which is a method inAVCaptureVideoPreviewLayer
. - You call the closure with the converted points.
In captureOutput(_:didOutput:from:)
, just after you declare the fingerTips
property, add:
defer {
DispatchQueue.main.sync {
self.processPoints(fingerTips)
}
}
This will send your finger tips to be processed on the main queue once the method is finished.
Time to show those fingertips to the user!
Displaying Fingertips
pointsProcessorHandler
is going to get your detected fingerprints on the screen. You have to pass the closure from SwiftUI to this view controller.
Go back to CameraView.swift and add a new property:
var pointsProcessorHandler: (([CGPoint]) -> Void)?
This gives you a place to store the closure in the view.
Then update makeUIViewController(context:)
by adding this line before the return statement:
cvc.pointsProcessorHandler = pointsProcessorHandler
This passes the closure to the view controller.
Open ContentView.swift and add the following property to the view definition:
@State private var overlayPoints: [CGPoint] = []
This state variable will hold the points grabbed in CameraView
. Replace the CameraView()
line with the following:
CameraView {
overlayPoints = $0
}
This closure is the pointsProcessorHandler
you added earlier and is called when you have the detected points. In the closure, you assign the points to overlayPoints
.
Finally, add this modifier before the edgesIgnoringSafeArea(.all)
modifier:
.overlay(
FingersOverlay(with: overlayPoints)
.foregroundColor(.orange)
)
You’re adding an overlay modifier to CameraView. Inside that modifier, you initialize FingersOverlay
with the detected points and set the color to orange.
FingersOverlay.swift is inside the starter project. Its only job is to draw points on screen.
Build and run. Check out the orange dots on your fingers. Move your hands around and notice the dots follow your fingers.
.overlay
modifier if you need.It’s finally time to add game logic.
Adding Game Logic
While the logic of the game is long, it’s pretty straightforward.
Open GameLogicController.swift and replace the class implementation with:
// 1
private var goalCount = 0
// 2
@Published var makeItRain = false
// 3
@Published private(set) var successBadge: Int?
// 4
private var shouldEvaluateResult = true
// 5
func start() {
makeItRain = true
}
// 6
func didRainStars(count: Int) {
goalCount = count
}
// 7
func checkStarsCount(_ count: Int) {
if !shouldEvaluateResult {
return
}
if count == goalCount {
shouldEvaluateResult = false
successBadge = count
DispatchQueue.main.asyncAfter(deadline: .now() + 3) {
self.successBadge = nil
self.makeItRain = true
self.shouldEvaluateResult = true
}
}
}
Here’s a breakdown:
- This property stores the number of dropped stars. The player has to guess this value by showing the appropriate number of fingers.
- Whenever something sets this published property to
true
,StarAnimator
starts raining. - If the player correctly guesses the number of dropped stars, you assign the goal count to this. The value appears on screen indicating success.
- This property prevents an excessive amount of evaluation. If the player guesses the value correctly, this property makes the evaluation stop.
- This is how the game starts. You call this when the starting screen appears.
- When
StarAnimator
rains a specific amount of stars, it calls this method to save the goal count in the game’s engine. - This is where the magic happens. You call this method whenever new points are available. It first checks if evaluating the result is possible. If the guessed value is correct, it stops the evaluation, sets the success badge value and resets the engine’s state to initial values after three seconds.
Open ContentView.swift to connect the GameLogicController
.
Replace the call to StarAnimator
, including its trailing closure, with:
StarAnimator(makeItRain: $gameLogicController.makeItRain) {
gameLogicController.didRainStars(count: $0)
}
This code reports the number of rained stars to the game engine.
Next, you’ll let the player know they’ve got the answer right.
Adding a Success Badge
Add a computed property for successBadge
as follows:
@ViewBuilder
private var successBadge: some View {
if let number = gameLogicController.successBadge {
Image(systemName: "\(number).circle.fill")
.resizable()
.imageScale(.large)
.foregroundColor(.white)
.frame(width: 200, height: 200)
.shadow(radius: 5)
} else {
EmptyView()
}
}
If the successBadge
of the game logic controller has a value, you create an image using a system image available in SFSymbols. Otherwise, you return an EmptyView
which means drawing nothing.
Add these two modifiers to the root ZStack
:
.onAppear {
// 1
gameLogicController.start()
}
.overlay(
// 2
successBadge
.animation(.default)
)
Here’s what you added:
- When the starting page of the game appears, you start the game.
- You draw the success badge on top of everything. The
successBadge
implementation comes next.
Next, remove the overlay for Rain since it now rains automatically.
Final Step
To make the game work, you need to pass the number of detected points to the game engine. Update the closure that you pass when initializing CameraView
in ContentView:
CameraView {
overlayPoints = $0
gameLogicController.checkStarsCount($0.count)
}
Build and run. Enjoy the game.
More Use Cases
You barely scratched the surface of Hand and Body Detection APIs in Vision. The framework can detect several body landmarks, as illustrated below:
Here are some examples of what you can do with these APIs:
- Install UI controls in your app using Vision framework. For example, some camera apps include features that let you show a hand gesture to take a picture.
- Build a fun emoji app where the user could show the emoji with hands.
- Build a workout analysis app where the user could find out if he or she is doing a specific action right.
- Build a music app to teach the user to play guitar or ukulele.
The possibilities are endless with Vision.
Where to Go From Here?
Download the final project using the Download Materials button at the top or bottom of this tutorial.
You’ve successfully created a game with Hand Detection APIs! Great job!
There are many great resources for Vision and these specific APIs. To explore this topic in more depth, try:
- WWDC 2020 Video on Body and Hand Detection APIs: A video with many more examples.
- Vision Framework Documentation: The overview of all the features available in Vision framework.
- Detecting Human Body Poses in Images: Documentation and an example app by Apple.
- raywenderlich.com Forums: Ask for help from our awesome community.
I hope you liked this tutorial. If you have any questions or comments, please join the discussion below!
Comments