Improve Core ML integration with async prediction

Back to WWDC23

Improve Core ML integration with async prediction

Learn how to speed up machine learning features in your app with the latest Core ML execution engine improvements and find out how aggressive asset caching can help with inference and faster model loads. We'll show you some of the latest options for async prediction and discuss considerations for balancing performance with overall memory usage to help you create a highly responsive app. Discover APIs to help you understand and maximize hardware utilization for your models. For more on optimizing Core ML model usage, check out "Use Core ML Tools for machine learning model compression" from WWDC23.

Resources
Related Videos

WWDC22
- Explore the machine learning development experience
- Optimize your Core ML usage
WWDC21
- Tune your Core ML models
Download

♪ ♪ Ben: Hi, I'm Ben Levine, an engineer on the Core ML team. Today, I'm going to talk about what's new when it comes to integrating Core ML into your app. Building intelligent experiences in your app has never been easier. The Xcode SDK provides a solid foundation for leveraging and deploying machine learning-powered features. A set of domain specific frameworks give you access to built-in intelligence through simple APIs. The capabilities they provide are powered by models trained and optimized by Apple. These models are executed via Core ML. The Core ML framework provides the engine for running machine learning models on-device. It allows you to easily deploy models customized for your app. It abstracts away the hardware details while leveraging the high-performance compute capabilities of Apple silicon with help from the Accelerate and Metal family of frameworks. Core ML's mission is to help you integrate machine learning models into your app. This year, our focus for Core ML was performance and flexibility. We made improvements in our workflow, API surface, and also our underlying inference engine. Before jumping into the workflow and highlighting new opportunities for you to optimize your Core ML integration, here's an idea of the potential performance benefits that you can get automatically by just updating to the latest OS.
When comparing the relative prediction time between iOS 16 and 17, you'll observe that iOS 17 is simply faster for many of your models. This speedup in the inference engine comes with the OS and doesn't require re-compilation of your models or making any changes to your code. The same is true for other platforms as well. Naturally, the amount of speedup is model and hardware dependent. Moving to the agenda, I'll start with an overview of the workflow when integrating Core ML into your app. Along the way, I'll highlight optimization opportunities for different parts of the workflow. Then I'll focus on model integration and discuss new APIs and behavior for compute availability, model lifecycle, and asynchronous prediction I'll start with an overview of the Core ML workflow. There are two phases for integrating Core ML into your app. First is developing your model, and second is using that model within your app. For model development, you have several options. One of the most convenient ways to develop your own model is to use Create ML. Create ML provides various templates for common machine learning tasks and can leverage the highly optimized models built into the OS. It guides you through the model development workflow and lets you interactively evaluate the results. If you want to learn more, check out this year's Create ML video.
Another way to develop a model is to train one using one of several python machine learning frameworks. Then, use the CoreMLTools python package to convert to the Core ML model format. Last, it's important you evaluate your model both in terms of accuracy and performance on Apple hardware. Using feedback from evaluation often results in revisiting some of these steps to further optimize your model. There are many opportunities for optimization in these steps. For training, how you collect and choose your training data is important. It should be consistent with the data being passed to the model when it's deployed and in your users' hands. The model architecture you choose is also important. You may be exploring multiple options, each with their own tradeoffs between training data requirements, accuracy, size, and performance. Many of these tradeoffs may not be fully visible at training time and require a few iterations through the full development flow.
Next is model conversion. Core ML tools offers various options to help optimize the converted model's precision, footprint, and computation cost. You can select input and output formats that best match your app's data flow to avoid unnecessary copies. If your input shape can vary, you can specify that variation, rather than choosing just one shape or switching amongst multiple shape-specific models. Compute precision can also be explicitly set for the whole model or individual operations. Both float32 and float16 are available. In addition to the precision of computation, you also have some control over how your model parameters are represented. CoreMLTools comes with a set of utilities for post-training weight quantization and compression. These utilities can help you significantly reduce the footprint of your model and improve performance on-device. However, to achieve these benefits, there is some tradeoff in accuracy. There are some new tools to help you in this space There's a new optimize submodule in the CoreMLTools package. It unifies and updates the post-training compression utilities and adds new quantization-aware training extensions for PyTorch. This gives you access to data-driven optimizations to help preserve accuracy for quantized models during training. This is coupled with new operations which support activation quantization in Core ML's ML Program model type. Check out this year's session on compressing machine learning models with Core ML to learn more.
Next is evaluation. One option to evaluate your model is to run predictions on the converted model directly from your python code with CoreMLTools. It will use the same Core ML inference stack that your app code will use and lets you quickly check how your choices during model conversion affect the model's accuracy and performance. Xcode also provides some helpful tools when it comes to evaluation and exploration of your models. Model previews are available for many common model types. This lets you provide some sample inputs to the model and preview the predicted output without having to write any code. Core ML performance reports provide you a breakdown of model computation performance for load, prediction, and compilation times on any attached device. Note that this can be useful to evaluate model architectures even before you've trained them.
Now, stepping back to the overall workflow, the next topic is model integration. Model integration is a part of developing your app. Just like any other resource you use in your app, you want to carefully manage and optimize how you use your Core ML model.
There are three steps in model integration. You first write the application code to use the model. You have code for where and when to load the model, how to prepare the model's input data, make predictions, and use the results.
Then you compile this code along with the model. And third, you test, run, and profile the model running within your app. When it comes to profiling, you may find the Core ML and Neural Engine instruments helpful. This is also an iterative process of design and optimization until you're ready to ship. There are several new additions this year for optimizing your model integration. First is compute availability. Core ML is supported on all Apple platforms and by default considers all available compute to optimize its execution. This includes the CPU, GPU, and Neural Engine when available. However, the performance characteristics and availability of these compute devices varies across supported hardware your app may run on. This may impact your users' experience with your ML powered features or influence your choice in models and configurations. For example, some experiences may require models running on the Neural Engine to meet performance or power requirements. There's now a new API for runtime inspection of compute device availability. The MLComputeDevice enum captures the type of compute device and the specific compute device's properties within its associated value. With the availableComputeDevices property on MLModel, you can inspect what devices are available to Core ML. For example, this code checks if there is a Neural Engine available. More specifically, it checks if the collection of all available compute devices contains one whose type is Neural Engine. The next topic for model integration is understanding the model lifecycle. I'll start by reviewing the different model asset types. There are two kinds: source models and compiled models. The source model has a file extension of MLModel or MLPackage. It's an open format designed for construction and editing. The compiled model has a file extension of MLModelC. It's designed for runtime access. In most cases, you add a source model to your app target, then Xcode compiles the model and puts it in the app's resources. At runtime, in order to use your model, you instantiate an MLModel.
Instantiation takes a URL to its compiled form and an optional configuration. The resulting MLModel has loaded all the necessary resources for optimal inference based on the specified configuration and device-specific hardware capabilities. Here's a deeper look into what happens during this load. First, Core ML checks a cache to see if it has already specialized the model based on the configuration and device. If it has, it loads the required resources from the cache and returns. This is called a cached load. If the configuration was not found in the cache, it then triggers a device-specialized compilation for it. Once this process completes, it adds the output to the cache and finishes the load from there. This is called an uncached load. For certain models, the uncached load can take a significant amount of time. However, it's focused on optimizing the model for the device and making subsequent loads as fast as possible.
During device specialization, Core ML first parses the model and applies general optimization passes to it. It then segments the chain of operations for specific compute devices based on the estimated performance and hardware availability. This segmentation is then cached. The last step is for each of the segments to go through a compute device specific compilation for the compute device they were assigned. This compilation includes further optimizations for the specific compute device and outputs an artifact that the compute device can run. Once complete, Core ML caches these artifacts to be used for subsequent model loads.
Core ML caches the specialized assets on the disk. They're tied to the model's path and configuration. These assets are meant to persist across app launches and reboots of your device. When the device's free disk space is running short, there has been a system update, or the compiled model has been deleted or modified, the operating system deletes the cache. If this happens, the next model load will perform the device specialization again.
In order to find out whether or not your model load is hitting the cache, you can trace your app with the Core ML Instrument and look at the load event. If it has the label "prepare and cache," then it was an uncached load, so Core ML performed the device specialization and cached the result. If the load event has the label "cached," then it was a cached load and did not incur a device specialization. This is new specifically for MLProgram models. Core ML performance reports can also give you visibility into the cost of a load. By default, it shows the median cached load.
It now has the option to display uncached load times as well. Since loading a model can be expensive in terms of latency and memory, here are some general best practices.
First, don't load models during your app's launch on the UI thread. Instead, consider using the async loading API or lazily load the model. Next, keep the model loaded if the application will likely be running many predictions in a row, rather than reloading the model for each prediction in the sequence. Lastly, you can unload the model if your app won't use it for a while. This can help alleviate memory pressure, and thanks to the caching, subsequent loads should be faster. Once your model is loaded, it's time to think about running predictions with the model. I'll jump into a demo to show the new async options.
To show the new async prediction API, I'll be using an app which displays a gallery of images and allows for applying filters to the images. I'll focus on a colorizing filter that uses a Core ML model which takes a grayscale image as input and outputs a colorized version of the image. Here's an example of the app in action. It starts by loading the original images, which are in grayscale, and then once I select the Colorized image mode, it colorizes the images using Core ML. As I scroll down, the model is definitely working, but it's a bit slower than I expected. Also, if I scroll far down, I notice that it takes quite a while for the images to be colorized.
As I scroll back up, it looks like it was spending time colorizing all of the images along the way. But in my SwiftUI code, I'm using a LazyVGrid to hold the images, so it should be cancelling tasks when views go off screen. Let me take a look at my current implementation to try to understand why the performance is lacking and also why it doesn't respect tasks being cancelled. This is the implementation. Since the synchronous prediction API is not thread safe, the app has to ensure that predictions are run serially on the model. This is achieved by making ColorizingService an actor, which will only allow one call to the colorize method at a time. This actor owns the colorizerModel, which is the auto-generated interface that is produced for the model bundled with the app. The colorize method currently performs two operations. It first prepares the input for the model, which involves resizing the image to match the model's input size. It then runs the input through the model and gets the colorized output. I went ahead and captured an Instruments trace of the app running with the Core ML Instruments template.
When looking at the Instruments trace, it shows that the predictions are run serially, which is ensured by the actor isolation. However, there are gaps around each prediction before the next one is run, which is contributing to the lack of performance. These are a result of the actor isolation being wrapped around not only the model prediction but also the input preparation. One improvement would be to mark the input preparation as a non-isolated method, so it won't block the next colorize request from entering the actor. While this would help, the Core ML predictions themselves would still be serialized, which is the bottleneck of my processing. To take advantage of concurrency for the Core ML predictions themselves, an option I can consider is the batch prediction API. It takes in a batch of inputs and runs them through the model. Under the hood, Core ML will take advantage of concurrency when possible. Making a batch version of the colorize method is pretty straightforward. However, the challenging part is figuring out how I'll collect the inputs into a batch and pass them to this method. There are actually multiple aspects of this use case that make it difficult to use the batch prediction API. The batch API is best used when there's a known quantity of work to be done. In this case, the amount of images to be processed is not fixed but a function of screen size and the amount of scrolling done. I can pick a batch size myself, but I'll have to handle cases where the batch size isn't met but still needs to be processed. Also, I'll have a different UI experience where images are colorized in batches. Lastly, I won't be able to cancel a batch even if the user scrolls away from it.
Because of these challenges, I'd rather stick with an API that handles one prediction at a time.
This is where the new async prediction API can be very useful. It's thread safe and works well for using Core ML alongside Swift concurrency. To switch to an async design for the code, I first changed the colorize method to be async. I then added the await keyword in front of the prediction call, which is required to use the new async version of the API. Then I changed ColorizingService to be a class rather than an actor. That way, multiple images can be colorized concurrently. Lastly, I added a cancellation check to the start of the method. The async prediction API will do its best to respond to cancellation, especially when multiple predictions are requested concurrently, but it's best to include an extra check at the start in this case. That way, it also avoids preparing the inputs if the task was cancelled before the colorize method was even entered. Now I'll make these changes and re-run the app.
Just as before, I'll set it to Colorized mode. I can already see the images are being colorized much faster. And if I do a quick scroll to the bottom, the images load almost immediately. Scrolling up a bit, I can verify the images are being colorized as I scroll back up, which means that the colorize calls were successfully cancelled the first time when I did the quick swipe to the bottom. When looking at a trace using this new async design, it shows the predictions are being run on multiple images concurrently. This is denoted by multiple prediction intervals stacked vertically. Since this model runs partially on the Neural Engine, it can also be observed in the Neural Engine Instrument as well. With the initial implementation, which colorized the images serially, colorizing the initial view of images without scrolling took about two seconds.
After switching to the async implementation, which colorized the images concurrently, that time was cut in half to about one second. So overall, I was able to achieve about a 2x improvement in total throughput by taking advantage of the async prediction API and concurrency with my Colorizer model. However, it's important to note that the amount a given model and use case may benefit from a concurrent design heavily depends on several factors, which include the model's operations, the compute units and hardware combination, and other work that the compute devices may be busy with. Also, the ML program and Pipeline model types will provide the best performance improvements from running predictions concurrently.
Overall, when adding concurrency to your app, you should carefully profile the workload to make sure it's actually benefiting your use case.
Another important thing to keep in mind when adding concurrency to your app is memory usage. Having many sets of model inputs and outputs loaded in memory concurrently can greatly increase the peak memory usage of your application. You can profile this by combining the Core ML Instrument with the Allocations Instrument.
The trace is showing that the memory usage of my app is rising quickly as I load many inputs into memory to run through the colorizer model.
A potential issue is that the colorize method from my code has no flow control, so the amount of images being colorized concurrently has no fixed limit. This may not be an issue if the model inputs and outputs are small. However, if they're large, then having many sets of these inputs and outputs in memory at the same time can greatly increase the peak memory usage of the app.
A way to improve this would be to add logic which limits the maximum amount of in-flight predictions. This will result in less inputs and outputs loaded in memory concurrently, which will decrease the peak memory usage while running predictions. In this example, if there are already two items being worked on, it defers new work items until a previous one has completed. The best strategy will depend on your use case. For example, when streaming data from a camera, you may simply want to drop work instead of deferring it. This way, you avoid accumulating frames and doing work that's no longer temporally relevant. Stepping back a bit, here's some general guidance on when to use the different prediction APIs.
If you're in a synchronous context and the time between each input being available is large compared to the model latency, then the sync prediction API works well. If your inputs become available in batches, then the batch prediction API is a natural fit.
If you're in an async context and have a large amount of inputs becoming available individually over time, that's when the async API can be most useful. To wrap up, as you move through the Core ML workflow, there are many opportunities for optimization during both model development and model integration. New compute availability APIs can help you make decisions at runtime based on what hardware is available on the device. Understanding the model lifecycle and caching behavior can help you best decide when and where to load and unload your model. And lastly, the async prediction API can help you integrate Core ML with other async Swift code and also improve throughput by supporting concurrent predictions. This was Ben from the Core ML team, and I am not an AI.
Looking for something specific? Enter a topic above and jump straight to the good stuff.

An error occurred when submitting your query. Please check your Internet connection and try again.

Resources

Related Videos

WWDC22

WWDC21