Metal Rendering Pipeline Tutorial

Take a deep dive through the rendering pipeline and create a Metal app that renders primitives on screen, in this excerpt from our book, Metal by Tutorials! By Marius Horga.

Leave a rating/review
Download materials
Save for later
Share
You are currently viewing page 3 of 5 of this article. Click here to view the first page.

The Rendering Pipeline

You finally get to investigate the GPU pipeline! In the following diagram, you can see the stages of the pipeline.

The graphics pipeline takes the vertices through multiple stages during which the vertices have their coordinates transformed between various spaces.

As a Metal programmer, you’re only concerned about the Vertex and Fragment Processing stages since they’re the only two programmable stages. Later in the tutorial, you’ll write both a vertex shader and a fragment shader. For all the non-programmable pipeline stages, such as Vertex Fetch, Primitive Assembly and Rasterization, the GPU has specially designed hardware units to serve those stages.

Next, you’ll go through each of the stages.

1 – Vertex Fetch

The name of this stage varies among various graphics Application Programming Interfaces (APIs). For example, DirectX calls it Input Assembling.

To start rendering 3D content, you first need a scene. A scene consists of models that have meshes of vertices. One of the simplest models is the cube which has 6 faces (12 triangles).

You use a vertex descriptor to define the way vertices will be read in along with their attributes such as position, texture coordinates, normal and color. You do have the option not to use a vertex descriptor and just send an array of vertices in an MTLBuffer, however, if you decide not to use one, you’ll need to know how the vertex buffer is organized ahead of time.

When the GPU fetches the vertex buffer, the MTLRenderCommandEncoder draw call tells the GPU whether the buffer is indexed. If the buffer is not indexed, the GPU assumes the buffer is an array and reads in one element at a time in order.

This indexing is important because vertices are cached for reuse. For example, a cube has twelve triangles and eight vertices (at the corners). If you don’t index, you’ll have to specify the vertices for each triangle and send thirty-six vertices to the GPU. This may not sound like a lot, but in a model that has several thousand vertices, vertex caching is important!

There is also a second cache for shaded vertices so that vertices that are accessed multiple times are only shaded once. A shaded vertex is one to which color was already applied. But that happens in the next stage.

A special hardware unit called the Scheduler sends the vertices and their attributes on to the Vertex Processing stage.

2 – Vertex Processing

In this stage, vertices are processed individually. You write code to calculate per-vertex lighting and color. More importantly, you send vertex coordinates through various coordinate spaces to reach their position in the final framebuffer.

Now it’s time to see what happens under the hood at the hardware level. Take a look at this modern architecture of an AMD GPU:

Going top-down, the GPU has:

  • 1 Graphics Command Processor: This coordinates the work processes.
  • 4 Shader Engines (SE): An SE is an organizational unit on the GPU that can serve an entire pipeline. Each SE has a geometry processor, a rasterizer and Compute Units.
  • 9 Compute Units (CU): A CU is nothing more than a group of shader cores.
  • 64 shader cores: A shader core is the basic building block of the GPU where all of the shading work is done.

In total, the 36 CUs have 2304 shader cores. Compare that to the number of cores in your quad-core CPU. Not fair, I know! :]

For mobile devices, the story is a little different. For comparison, take a look at the following image showing a GPU similar to those in recent iOS devices. Instead of having SEs and CUs, the PowerVR GPU has Unified Shading Clusters (USC). This particular GPU model has 6 USCs and 32 cores per USC for a total of only 192 cores.

Note: The iPhone X has the most recent mobile GPU which is entirely designed in-house by Apple. Unfortunately, Apple has not made the GPU hardware specifications public.

So what can you do with that many cores? Since these cores are specialized in both vertex and fragment shading, one obvious thing to do is give all the cores work to do in parallel so that the processing of vertices or fragments is done faster. There are a few rules, though. Inside a CU, you can only process either vertices or fragments at one time. Good thing there’s thirty-six of those! Another rule is that you can only process one shader function per SE. Having four SE’s lets you combine work in interesting and useful ways. For example, you can run one fragment shader on one SE and a second fragment shader on a second SE at one time. Or you can separate your vertex shader from your fragment shader and have them run in parallel but on different SEs.

It’s now time to see vertex processing in action! The vertex shader you’re about to write is minimal but encapsulates most of the necessary vertex shader syntax you’ll need.

Create a new file using the Metal File template and name it Shaders.metal. Then, add this code at the end of the file:

// 1
struct VertexIn {
  float4 position [[ attribute(0) ]];
};

// 2
vertex float4 vertex_main(const VertexIn vertexIn [[ stage_in ]]) {
  return vertexIn.position;
}

Going through this code:

  1. Create a struct VertexIn to describe the vertex attributes that match the vertex descriptor you set up earlier. In this case, just position.
  2. Implement a vertex shader, vertex_main, that takes in VertexIn structs and returns vertex positions as float4 types.

Remember that vertices are indexed in the vertex buffer. The vertex shader gets the current index via the [[ stage_in ]] attribute and unpacks the VertexIn struct cached for the vertex at the current index.

Compute Units can process (at one time) batches of vertices up to their maximum number of shader cores. This batch can fit entirely in the CU cache and vertices can thus be reused as needed. The batch will keep the CU busy until the processing is done but other CUs should become available to process the next batch.

As soon as the vertex processing is done, the cache is cleared for the next batches of vertices. At this point, vertices are now ordered and grouped, ready to be sent to the primitive assembly stage.

To recap, the CPU sent the GPU a vertex buffer that you created from the model’s mesh. You configured the vertex buffer using a vertex descriptor that tells the GPU how the vertex data is structured. On the GPU, you created a struct to encapsulate the vertex attributes. The vertex shader takes in this struct, as a function argument, and through the [[ stage_in ]] qualifier, acknowledges that position comes from the CPU via the [[ attribute(0) ]] position in the vertex buffer. The vertex shader then processes all the vertices and returns their positions as a float4.

A special hardware unit called Distributer sends the grouped blocks of vertices on to the Primitive Assembly stage.