Optimize your high-end games for Apple GPUs: We'll show you how you can use our rendering and debugging tools to eliminate performance issues and make your games great on Apple platforms. Learn from our experiences working with developers at Larian Studios and 4A Games as we help them optimize their games for Apple GPUs.
We'll explore various techniques for improving your game's performance, including optimizing shaders, reducing memory bandwidth utilization, and increasing the overlap of your GPU workloads. We'll also dive into the new GPU Timeline profiling tool in Xcode 13 to identify possible performance bottlenecks in “Divinity: Original Sin 2” when running on iPad.
For this session, you should be familiar with the tile-based deferred rendering architecture in Apple GPUs, and have a working knowledge of Xcode and the Metal API.
Check out “Discover Metal debugging, profiling, and asset creation tools” or the WWDC20 session “Optimize Metal apps and games with GPU counters” to learn more about using our tools to profile graphics workloads.
Welcome to WWDC. Hi, I'm Jonathan Metzgar. I'm a member of the Metal Ecosystem team at Apple. We get to work with game developers to help them get the best graphics performance on our Apple GPUs. Dustin and I are going to show you how we optimize high-end games for Apple GPUs. In this video, I'm going to cover the process that we use to optimize games. Then, I'm going to show you the kinds of optimizations that are used in the games Baldur's Gate 3 and Metro Exodus. And lastly, Dustin is going to do a tools demonstration featuring the game Divinity: Original Sin 2, while he introduces the new GPU Timeline in Xcode 13. Let's dive in and talk about optimization. So, over the past year, we collaborated with Larian Studios and 4A Games to find ways to tune the graphics performance in their games for Apple GPUs. I am sure you'll be excited to see the details, and I want to take a moment and thank both Larian Studios and 4A Games for giving us permission to show development materials in this presentation. Looking back over the course of the year, we have analyzed many games and identified some common scenarios that affect graphics performance. You're probably interested in finding opportunities to optimize your own game, so we have geared this session to emphasize how our GPU tools are especially helpful in pinpointing these problem areas and to suggest ways to solve them. And, in particular, I'd like to share some of the principles our team uses to help developers optimize their games. When we optimize a graphics application, it's important to have a methodology, a set of principles that define how we solve a particular problem. So, let me show you a four-step process. First, you need to choose what data to collect, or measure, so it will help you understand what's happening with your game. Soon after you begin measuring data, you will want to choose some performance targets, or where you want to be when you finish. You may decide the in-game location to take your GPU frame captures and Metal system traces, the scene complexity, graphics settings, and other metrics important to you, like frame time. Then, you analyze the data to learn about the behavior of your engine. In-depth analysis helps you find where and why the bottlenecks are occurring. Once you know what is causing a bottleneck, then you can make improvements to the game, but normally you pick one or two at a time, so you can understand the impact of each change. Lastly, you verify your improvements by comparing some new measurements with your original ones. Since optimization is a process, you will go back and repeat until your performance targets have been met. For these games, we use Xcode's Metal Debugger to give us insights about their performance and how their frame graphs are structured, and we use Metal System Trace in Instruments to learn about a game's performance over time. It's a great idea to save a GPU trace file and an Instruments trace file so you can have your before and after data, both before and after optimization. So, I have a little list of things you could consider, or look for, in your game. As I mentioned, Xcode and Instruments are great tools to help you understand your Metal application. Optimization is about getting the best out of several areas, ranging from shader performance to memory bandwidth. Another area is getting good overlap across your vertex, fragment, and compute workloads. And while rendering several frames in flight, some Apple GPUs can overlap workloads between them. I'll show you some pointers to help you with resource dependencies, which might prevent that overlap. And since some developers use a custom workflow for their shaders, I'll show you how compiler settings can affect performance. Lastly, I'll talk about how to reduce the impact of redundant bindings. Let's start with Baldur's Gate 3 from Larian Studios. Baldur's Gate 3 is an RPG building on a 20-year gaming legacy and stands out with its cinematic visual effects. Our engagement with Larian Studios helped us identify how they could optimize their amazing rendering engine for Apple GPUs. First, we started with a GPU frame capture, like the Ravaged Beach scene we see here. Then, we break down the scene into a frame graph. The frame graph is a breakdown of the order and purpose of each rendering pass. High-end games have many render passes specializing in achieving a certain visual effect, such as ambient occlusion, shadow mapping, post processing, and so on. Baldur's Gate 3 has a complex frame graph, so this is a simplified version. By using Xcode's Metal Debugger, we capture a GPU trace and use it to see all the render passes in the game. Clicking on Show Dependencies brings up a visualization that you can pan and zoom. It shows how your render passes depend on the results of previous ones to help you understand what's going on. For example, I am zooming into this deferred decal render stage to get more details. Next, I will show you the Instruments tools. We spend time analyzing games using the Instruments trace, using the Metal System Trace, or Game performance templates. Metal System Trace is ideal if you wanna focus on GPU execution and scheduling analysis, and Game Performance expands on that to help you with other issues, like thread stalls or thermal notifications. Let's choose Metal System Trace to see the behavior of our engine from frame to frame. Instruments allows you to view several channels of data along a timeline. Here, we find our first problem: An expensive workload in our render passes. An expensive workload might mean that we need to optimize a shader. For instance, we see a long compute shader holding up the rest of our frame. We call these gaps "bubbles." Let's switch back over to the GPU trace and investigate this further. This is the "before" GPU trace. Let's change the grouping from API CALL to PIPELINE STATE. You may notice the pipeline states are sorted by execution time. Let's check the first compute pipeline. We can expand the compute function details to take a closer look at its statistics. Notice here that there are over four-and-a-half-thousand instructions. That's quite a lot. So, what else? Let's see what resources are being used by this compute function. Depending on the input data, this function uses up to 120 textures to produce the output. However, we discovered that only six to 12 are actually used 90% of the time. So, let's talk about how this shader could be improved. Shaders that need to handle many different conditions can reserve more registers than necessary, and this can reduce the number of threads that run in parallel. Splitting your workload into smaller, more focused shaders, which need fewer registers, can improve the utilization of the shader cores. So, instead of selecting the appropriate algorithm in the shader, you would choose the appropriate shader permutation when you issue your GPU workload. Additionally, a shader function which uses too many registers can result in register pressure, when an execution unit runs out of fast register memory and has to use device memory instead. That's one reason to use 16-bit types, like half, when appropriate, since they use half the register space than 32-bit types, like floats. In this case, Larian Studios already optimized their shader to use half-precision floating point and decided to create dedicated shader variants, instead. So, let's see what happened. When comparing the numbers before, in the box on the left, with the numbers in the box on the right, the number of instructions reduced by 84%, branches reduced 90%, registers reduced 25%, and texture reads reduced 92%. This shader variant is used 90% of the time. We can also see this in the Metal System Trace. Notice here, in the before trace, the bubbles we saw earlier. And here, in the after trace, they have been minimized. Larian Studios was able to reduce this shader by eight milliseconds, on average. That is a huge win! If you look at your most expensive pipeline state objects and shaders, you may find a complicated shader that could be simplified. This is especially true if the results of that shader are used by a later pass. This was a huge improvement for the game, but short of the developer's performance target. We just mentioned memory as an issue, and one of the features of our GPUs is lossless compression, which is enabled in certain conditions. So, maybe there was a flag we either accidentally set or forgot to set. Lossless compression helps reduce bandwidth by compressing textures when they are stored from tile to device memory. If you look at the Bandwidth Insights on the Summary page, you may notice Lossless Compression warnings for some textures. They tell you that these textures can't be lossless compressed, and you may pay a bandwidth penalty. Metal Debugger will also let you know why these textures can't be lossless compressed. Here we see it's because of the ShaderWrite usage flag. We can see all the usage flags by going to the memory section. Once in the memory section, we can filter by render targets. Then, right click on the table header, choose texture, and then usage. Now, we can sort by usage and find the textures using ShaderWrite. If you set the ShaderWrite or PixelFormatView flag when you create your textures, you will disable lossless compression. Let's take a look at these flags in more detail. The Unknown, ShaderWrite, and PixelFormatView flags prevent your textures from being lossless compressed. The general rule of thumb is to use these flags only when required. For example, you would use the ShaderWrite flag if you use the write() method to store values in a texture from a fragment or compute function. Rendering to a texture bound as a color attachment doesn't require the ShaderWrite flag. And don't set the PixelFormatView option if you only need to read the component values in a different order. Instead, create a texture view using a swizzle pattern to specify the new order. Similarly, don't set the PixelFormatView option if your texture view only converts between linear space and sRGB. Check the documentation for more information. Shader optimization and lossless compression are two techniques that have helped us out, but another problem area is getting good overlap across the vertex, fragment, and compute channels. Let's take a look at two ways to optimize workloads across channels. First, we'll start by looking at our Metal System Trace again. Here, we can see that we have low overlap on our vertex, fragment, and compute channels. It would be nice to improve this to keep the GPU busy. One way to solve this problem is to see if we can restructure the encoding order in our frame graph. In other words, we want to move this work over to where the vertex stage has very low occupancy. We would like to process those vertices earlier, along with the fragment stage of an earlier render pass. We can think of our frame graph as a list of rendering tasks, like this pseudocode example. Getting good overlap can be as simple as changing the order of your render tasks in your frame graph. Some tasks may rely on results from earlier ones, but not always. It turns out that the CascadedShadowBuffer stage, which is vertex-shader heavy, could be moved a few tasks earlier, since it has few dependencies. And now, we see that our region with low overlap has better utilization of the vertex and fragment channels, giving us another 1 ms win. But there is another optimization that we can try out. Games often have two to three frames in flight. So, a cool feature in our tile-based deferred rendering, or TBDR architecture GPUs, is to overlap workloads from two frames when there are no resource dependencies between them. So, I'm going to show you how to optimize for this possibility. Let's have a look at the GPU track in Instruments once again. Here, you can see that these frames are processed, almost serially. This is caused by using a blit encoder to update constant buffers, like per-frame animation data, and so on. To efficiently update constant buffer data with a discrete GPU, we blit from shared buffers on the CPU to a private buffer on the GPU, which will be used for rendering the frame. This strategy is efficient for GPUs with discrete memory, so you want to keep this behavior for that purpose. If your device has a unified memory architecture, then there is no need to use a blit encoder to copy your data to a private buffer. However, when you use a shared buffer in a ring-buffer pattern, you need to watch out for synchronization issues because visual corruption can happen if your CPU writes to data currently being read by the GPU. Let's see this in action. Here, you can see in this diagram the encoding and rendering of our frames. We are using colors to represent the shared buffers, which are updated at the beginning of the frame: blue for buffer one, green for buffer two, and yellow for buffer three. Ring buffers are typically used to implement queues, which need to use a compact amount of memory. Here, there is no concern of a data race condition with this arrangement, as our writing and reading of our shared buffers is mutually exclusive. It's very common to have latency between encoding the frame and the rendering of a frame. This causes a shift of when the rendering actually begins. As long as the latency isn't too long, you will not have a data race condition. However, what happens if latency continues to increase? Well, this introduces a data race condition, where the main thread is updating its shared buffers during the time the GPU is rendering the frame. And if that happens, you could get visual corruption if elements of your frame are dependent on this data. In the case of Baldur's Gate 3, removing the private buffer and blit encoder eliminated the synchronization point, but introduced a race condition, which affected their temporal anti-aliasing render pass. So, let's see how to avoid this situation. To avoid this race condition, you need to make sure you are not writing into the same resource the GPU is reading from. For example, you could utilize a completion handler, and then wait until it is safe to update the shared buffer in your encoding thread. But let me show you how we avoided a wait time. We maintained our completion handler, but added an extra buffer to our ring buffer to avoid the wait. The extra buffer is colored purple on the bottom diagram. The memory consumption remains the same as with a discrete GPU. But if you need to save on memory, and the CPU wait time doesn't affect frame rate of your game, then you can just use three buffers. So, let's look at an easy way to decide how many shared and private buffers to create with a pseudocode example. In this code snippet, you can see how to choose the number of shared and private buffers at initialization time. Once we have created our device, we can check to see if the device has unified memory or not, and then ensure that we create an extra shared buffer, or to use a private buffer. This extra buffer will help reduce the impact of waiting for a completion handler, which we are using to avoid a data race condition. And now, we can see how Fragment workloads from the previous frame overlaps with Vertex workloads from the next frame. Overall, this can give us one to two milliseconds, depending on the scene. And, of course, this approach can be applied not only for the constant buffer data we've shown in this example, but for all of the buffer data you transfer from the CPU to the GPU. So, let's review. Larian Studios was able to achieve their performance targets by applying the following optimizations: Optimizing their most expensive shaders to reduce bubbles, opting in to lossless compression to improve bandwidth, overlapping vertex and fragment workloads to get better GPU utilization, and checking for resource dependencies that prevent frame overlap. When they were finished, Larian Studios not only met their performance targets, but got a 33% improvement in frame time for their game. And now, we will look at a different set of optimizations with the game Metro Exodus. Metro Exodus is known for its epic storyline and demanding visual effects, as you can see in this series of game-play clips. After the integration of our suggested optimizations, 4A Games was able to meet their performance targets. So now, let's have a look at an in-game scene from Metro Exodus. Metro Exodus uses a custom workflow to translate render commands into Metal API commands, which is quite common for cross-platform games. The translation layer they are using is optimized for Metal, but some issues can arise when two complex systems come together in practice. So, additional performance tuning was required to meet their project goals. As in the previous game, we start by investigating how a frame is being rendered. Modern renderers have a lot of different techniques involved so first we try to understand the high-level frame graph. Again, we start analysis by looking at the GPU trace. It always gives us useful insights about game performance. So first, let's start with the GPU time, which doesn't meet the developer performance targets. So, let's find the shader or pipeline which is the most time-consuming. To do this, we are going to group by pipeline state once again and look at the most expensive one. Let's quickly look at its statistics. You can see that there is a high number of ALU instructions compared to the total, meaning this is a math-heavy shader. We also see that the number of registers being used by the shader is quite high. The number of registers used by a particular shader directly affects how its workload will scale during execution. The higher this number is, the less work can be done in parallel by the GPU. Sometimes it's just a complex shader, such as SSAO in this example, that requires lots of computations and registers, but sometimes the compiler settings can affect the generated instructions and register allocation, as well. Let's also take a look at the shader compiler options. And it turns out, this shader was compiled with the fast math flag disabled. Fast math allows the shader compiler to optimize various instructions, and it is enabled for the Metal shader compiler, by default. However, there might be some cases, for example, using custom shader workflows, that can disable this compilation flag. In this case, we discovered that the translation layer, which 4A Games was using to invoke the compiler, had its default behavior set to not use fast math. So, what is fast math? Fast math is a set of optimizations for floating-point arithmetic, that trades between speed and correctness. For example, assumptions can be made that there will be no NANs, infinity, or signed zeros as either a result or argument. Fast math optimizations can also apply algebraically-equivalent transformations, which may affect the precision in floating-point results. However in most scenarios, fast math is a great choice for games. This can significantly improve performance, especially in ALU-bound cases. Our recommendation to you is to check your compiler options to verify that you have enabled fast math, if your shaders do not depend on the things that we just mentioned.
The fast math flag works at the front- and back-end compiler levels. When you are building your shader source, the front-end shader compiler will select fast math functions, which will be used in intermediate code. This will hint to the back-end shader compiler that it can generate more optimal GPU machine code. Here, you can see how the Instructions and Register counters on the left have been improved in the box on the right after we recompiled this shader. So, after changing the behavior of the translation layer to enable fast math for all the shaders, we got a 21% frame time decrease in our test workload using the built-in game benchmark. So, the next area I wanna talk about is redundant bindings. If we go back to the summary page, and look at the API insights, we can see there are many redundant bindings when rendering the frame. Redundant bindings can be either resources like textures, buffers, and samplers; or render states like depth stencil state, viewport configuration, etc. Repeatedly binding resources might negatively affect your encoding time, but redundant render state changes may also affect the GPU time. Let's have a look at the encoding and GPU times in the Metal System Trace. For a given frame, it takes eight-and-a-half milliseconds for all the commands to be encoded and around 22 milliseconds for the GPU to render this frame. When we investigated the cause of the redundant bindings, we found that the translation layer could be modified to reduce them. So, let me show you a pseudocode example which shows how to check for and reduce redundant bindings. Instead of binding textures directly to the encoder, you can pre-cache them and only bind them if they change. And to minimize interactions with the API, you can set all the textures with one call to the setFragmentTextures method instead of setting them in a loop, one by one. Additionally, you can apply a similar approach to other shader stages and other binding types, like buffers and samplers, as well as render states. So, let's see what happened in the Metal System Trace. 4A Games was able to reduce encoding time between 30% and 50%, depending on the scene, because the translation layer wasn't repeatedly binding the same resources and render states. However, GPU time also decreased by up to three milliseconds and, overall, resulted in a 15% speedup in their in-game benchmark. If you have a few redundant binding warnings, it's not an issue, but we definitely see an impact with hundreds or thousands of redundant bindings. So, avoiding redundant bindings gave us a further 15% reduction in average frame time. After these two improvements, 4A Games was able to meet their performance targets. So now, let's summarize what we learned from optimizing Metro Exodus for Apple GPUs. First, if you're using a custom workflow for shaders, you should check your compiler settings to ensure you are using the best options for your Metal applications. And if you see a lot of redundant binding warnings in the Metal Debugger, I showed you a technique to reduce encoding and GPU time overhead, which you can apply either to your engine or the translation layer that you are using. And now, I'd like to hand it over to Dustin, who's going to talk to you about Divinity: Original Sin 2 and demo the new Xcode GPU timeline features. Thanks, Jonathan. Hi, my name is Dustin, and I work on the GPU Software team here at Apple. And today, I'm excited to show you a hands-on demo optimizing an early build of Larian Studios hit title, Divinity: Original Sin 2. Last year, Larian announced they were bringing their critically-acclaimed role-playing game Divinity: Original Sin 2 to the iPad. And over the last year, Larian has worked hard optimizing their game to run great on Apple GPU's, and the game is a lot of fun to play. Larian was able to achieve these results with the help of a great set of tools in Metal Debugger and Metal System Trace that are getting even better this year in Xcode 13 with the addition of the new GPU Timeline. Let's get started by taking a look at a frame of Divinity: Original Sin 2 I captured earlier. We are here on the Summary Page, which contains an overview of your frame that helps to guide you along the way as you debug and optimize your game. From the Summary Page, we can quickly navigate to all the great tools offered by the Metal Debugger, including the new GPU Timeline. And accessing it is as easy as clicking on the new Performance page here. So, let me go ahead and do that. Introducing the new GPU Timeline. The Timeline has been designed around Apple GPU's unique architecture that allows each GPU pipeline stage to run in parallel. In order to maximize performance, we need to keep all pipeline stages as busy as possible by maximizing overlap, which the Timeline allows you to easily see. The Timeline is composed of two sections. On the top, we have the GPU section, which is composed of separate tracks for each pipeline stage, making it really easy to see which stages are active and running in parallel. Underneath, we have the Counters section, which contains a curated set of important counters, such as shader occupancy, bandwidth, and performance limiters that provides us with deeper insight into how the GPU's system performance changes over the course of your workload. The encoders in the GPU tracks provide us with a lot of useful information, with even more just a click away. Selecting a Render Encoder brings up the Timeline's sidebar, which contains additional information for the currently-selected item. In this case, the sidebar contains render pass information, such as texture details, load/store actions, and the number of draw calls. Notice that since Render Encoders are composed of two shader stages, both the vertex and fragment stages are highlighted, as well. If we select the Fragment track instead, the sidebar contains all of the encoders in the Timeline, which can then be sorted based on time. But that's not all because we can expand the Fragment track to reveal the Shader Timeline, which shows all of the shaders used by the encoders during their execution. We can easily identify long-running shaders, as well as which shaders are running in parallel with others. For the Fragment track, we also have two additional tracks for load/store actions. This is useful to be able to see when the GPU is loading and storing attachment textures between local and main memory, and is an important consideration in order to reduce bandwidth usage. Selecting a shader will highlight all the regions on the timeline where it is active, and we can learn more about it from its compiler statistics and runtime performance metrics presented in the sidebar. Expanding the shader timeline shows each shader in its own track, which is useful for understanding the flow of your GPU workload and the order of shader execution. Now that you're a bit more familiar with the new GPU Timeline and thinking of all the ways that you will be able to use it yourself, let me show you just how how easy it is to find performance bottlenecks using the GPU Timeline. Shader performance can suffer as a result of many factors, one of which is register pressure, and when this happens, the GPU runs out of fast register memory and has to use main memory instead. A high ALU limiter alone does not indicate a performance bottleneck. It may just be that your shader is math heavy. However, when combined with low shader occupancy, this may be an indicator of a shader experiencing register pressure, which will cause your shader to run slower. In order to highlight this better for today's demo, let me pin both the ALU track and the shader occupancy track to the top of the Timeline by clicking the "Plus" button here on the left.
As I scan over these two tracks, the first thing I notice is this region here, where the ALU spikes and, at the same time, shader occupancy drops. I can highlight a region on the Timeline to see how long it takes to execute. Notice as I do this, the counters in the sidebar update dynamically based on the selected region. This region here is taking about 3.7 milliseconds to execute. Let's zoom in and take a closer look. It looks like our issue is related to these first four encoders of the Ambient Occlusion pass. Let's see what shaders are being used by taking a look at the shader timeline. Looks like our issue is related to this shader here, as it's the only one being used. From the sidebar's runtime performance metrics, not only is this shader ALU intensive, it is float heavy, as well, so let's take a look at the Floating Point Utilization track. Notice as I hover over this track, this shader is only using F32. F16 is at 0%. From the Timeline, we can navigate directly to shader source by right clicking and opening the shader. Here in the source editor, we can see a simplified version of the shader source for demo purposes. Along with source, we can also see per-line cost information with the help of the shader profiler. Hovering over the shader profiler pie chart provides us with confirmation that this function is likely causing register pressure, as it is both ALU and float heavy. Situations like this are candidates for using F16, which gives us double the amount of registers in places where the full precision of F32 is not required, which will help to reduce register pressure. Metal Debugger makes it really convenient to update source code directly inside the source editor. Let me make this change here that uses an updated version of the shader that uses a mixture of F32 and F16. After making this change, I can click the "Reload Shaders" button down here at the bottom, which will trigger a shader update that both recompiles and reprofiles our shader, as well as updating the per-line shader costs. Let's see what effect this change has made by going back to the Timeline. The first thing I'd like to do is see how long those first four encoders of the Ambient Occlusion pass are taking.
Looks like this region here is taking about 2.6 milliseconds to execute. The change we just made has improved our shader execution time by over one millisecond, or 30%, which is a huge improvement. Taking a look at some of the counters from earlier, while ALU is still high, that is to be expected for a math-heavy shader. But notice now, our shader is experiencing less register pressure, as our shader occupancy has improved by almost double. This was accomplished by using a mixture of F32 and F16, which we can see using the Floating Point Utilization track. The GPU Timeline made it really easy for me to identify the issue, navigate to where the problem existed, and get it fixed. The GPU Timeline is a great tool for identifying not only shader performance issues, but also memory bandwidth and many other kinds of issues. I hope you enjoyed this demo of the brand-new GPU Timeline and are already thinking of all the ways that you're going to use it to optimize your games to run even better on Apple GPUs. Thank you, and enjoy the rest of WWDC. Back to Jonathan. Thank you, Dustin, for that amazing demo. And thank you for watching. It was great to share with you how we worked with Larian Studios and 4A Games to take advantage of the features on our Apple GPUs. They provide many ways to improve performance, ranging from lossless compression to overlapping shader workloads. And our tools, like Metal System Trace and the new GPU timeline in Xcode, will really be helpful to you as you improve your games. If there's one thing I can leave you with, a thorough examination of your rendering is essential to delivering a highly-optimized game, and our tools are there to help you with this. If you'd like to learn more, please refer to the related sessions, "Discover Metal debugging, profiling, and asset creation tools" in this year's WWDC, or "Optimize Metal apps and games with GPU counters" from WWDC20. Thank you, and farewell! [music]
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.