Using Core ML models in Vision makes the creation of powerful Computer Vision applications easy. Learn how easy it is to use custom trained classifiers and object recognition models in a live camera capture. In addition, you'll learn about the latest additions to the Vision Framework along with a deeper dive into some its fundamentals.
Good afternoon, and welcome to WWDC. I know I'm competing now with the coffee and cookies, but I don't know if the coffee is gluten free and the cookies are actually caffeine free, so stick with me. My name is Frank Doepke and I'm going to talk about some interesting stuff that you can do with Computer Vision using Core ML and Division Framework.
So, what are we going to talk about today? First, you might have heard that we have something special for you in terms of custom image classification.
Then we're going to do some object recognition. And last but not least, I'm going to raise your level of Vision awareness by diving into some of the fundamentals.
Now, Custom Image Classification, we've seen the advantages already in some of the presentations earlier, and we see like what we can do with flowers and fruits and I like flowers and fruits as much as everybody else. Sorry, Lizzie, but I thought we'd do something a bit more technical here. So, the idea is, what can we do if we create a shop and we are all geeks, so let's build a shop where we can build robots. So, there are some parts that we need to identify.
And I thought it would be great if I have an app that actually can help customers in my store identify what these objects are.
So, we've got to train a customer classifier for that. Then, once we have our classifier, we build an iOS app around that, that we can actually run on all devices.
And when going through this, I'm going to go through some of the common pitfalls when you actually want to do any of this stuff and try to guide you through that. Let's start with our training.
So, how do we train? We use of course, Create ML.
The first step of course we have to do, is we have to take pictures.
Then we put them in folders and we use the folder names as our classification labels.
Now, the biggest question that everybody has, "How much data do I need?" The first thing is, well, we need a minimum of about 10 images per category, but that's on the low side. You definitely want to have more. The more, the better, actually your classifier will perform better.
Another thing to look out for is highly imbalanced data sets. What do I mean by that? When a data set has like thousands of images in one category, and only ten in the other one, this model will not train really well. So, you want to have like an equal distribution between most of your categories.
Another thing that we actually introduce, is augmentation.
Augmentation will help you to make this model more robust, but it doesn't really replace the variety. So, you still want to have lots of images of your objects that you want to classify. But, with the augmentation, what we're going to do is, we take an image and we perturb it. So, we have noised it, we blur it, we rotate it, flip it, so it looks different to the classifier actually when we train it.
Let's look a little bit under the hood of how our training actually works.
You might have already heard it, the term, transfer learning. And this is what we're going to use in Create ML when we train our classifier.
So, we start with a pretrained model, and that's where all the heavy lifting actually happens. These models train normally for weeks, and with millions of images, and that is the first starting point that you need to actually work with this. Out of this model, we can use this as a feature extractor. This gives us a feature vector which is pretty much a numerical description of what we have in our image.
Now, you bring in your data and we train the set, what we call the last layer, which is the real classifier, on your label data and out comes your custom model. Now, I mentioned already this large, first pretrained model.
And we have something new in Vision for this . This is what we call the Vision FeaturePrint for Scenes.
It's available through Create ML and it allows you to train an image classifier.
It has been trained on a very large data set, and it is capable of categorizing over a thousand categories. That's a pretty good distribution that you can actually use .
And we've already used it. Over the last few years, through some of the user pictures that you've seen in photos, have been actually using this model underneath.
We're also going to continuously improve on that model, but there's a small caveat that I would like to kind of highlight here.
When we come out with a new version of that model, you will not necessarily automatically get the benefits unless you retrain new model. So, if you start developing with this, this year, and you want to you know, take advantage of whatever we come out over the next years, hold onto your data sets so you can actually retrain . A few more things about our feature extractor. It's already on the device, and that was kind of an important decision for us, because it makes the disc footprint for your models significantly smaller.
So, let's compare a little bit.
So, I chose some common, you know, available models that we use today. The first thing would be Resnet. So, if I train my classifier on top of Resnet, how big is my model? Ninety-eight megabytes.
If I use Squeezenet, so Squeezenet is a much smaller model. It's not capable of differentiating as many categories, and that's 5 megabytes. So, it's about saving there, but it will not be as versatile. Now, how about Vision? It's less than a megabyte in most cases. The other thing of course why we believe that this is a good choice for use, it's already optimized. And we know a few things about our hardware or GPUs and CPUS and we really optimized a lot on that model so that it performs best on our devices. So, how do we train ? We start with some labeled images, and bring them into Create ML, and Create ML knows how to extract Vision Feature Print.
It trains our classifier and that classifier is all what will go into our Core ML model. That's why it is so small. Now, when it comes time that I actually want to analyze an image, all I have to do is use my image and model, and now in Vision or in Core ML, it knows another again how to train -- sorry. Not train. In this case, use our Vision Feature Print and we'll the classification. So, that was everything we needed to know about the training.
But, I said there was some caveats that you want to kind of look at when we deal with the app. So, first thing, we only want to actually run our classifier when we really have to.
Classifiers are deep convolutional networks that are pretty computational intensive. So, when we run those, it will definitely use up kind of you know, some electrons running on the CPU and GPU. So, you don't want to use this unless you really have to.
In the example that I'm going to show in my demo laters, I really only want to classify if actually the person looks really at an item and not when just a camera moves around.
So, I'm asking the question, "Am I holding still?" and then I'm going to run my classifier.
How do I do this? By using Vision, I can use Registration. Registration means I can take two images and align them with each other. And you're going to tell me, "Okay, if you shift it by this amount of pixels, this is actually in how they would actually match." This is a pretty cheap and fast algorithm, and it will tell me if I hold the camera still or if anything is moving in front of the camera. I used VN Translational Image Registration Request. I know that is a mouthful.
But that will give me all this information. So, to visualize this first, let's look at the little video. What I'm doing in this video is I show basically a little yellow line that shows me how basically my camera has moved or the Registration requests have moved over the last couple of frames. So, what out for that yellow line and see if it's long, then I have moved the camera around quite a bit, and when I'm holding it still, it should actually be a very small line. So, you see the camera is moving, and now I'm focusing on this, and it gets very short.
So, that's just a good idea. It's like, "Okay, now I'm holding still. Now, I want to run my classifier." Next thing to keep in mind is, have a backup plan. It's always good to have a backup plan.
Classifications can be wrong.
And what that means, even if my classification actually has a high confidence, I need to kind of plan for sometimes that it doesn't work quite correctly.
The one thing that I did in my example here, as you will see later on, I have something wherefore which I don't have a physical object. So, how do I solve that? In my example, I'm using our backward detector that we have in the Vision Framework to read some data backward label to identify this.
Alright, enough of slides.
Who wants to see the demo? Okay, what you see on the right-hand side of the screen is my device. I'm going to start now my little robot shop application. And when you see I'm moving around, there's nothing happening. When I hold still, and I point it at something, I should see a yellow line and then voila, yes step is a stepper motor.
Okay? Let's see what else do we have here on this table? That is my micro controller. That's a stepper motor driver. We can also like you know, pick something up and hold it.
Yes, this is a closed loop belt.
What do we have here? Lead screw. And as I said, you can also look at the barcode here, if I get my cable long enough.
And that is my training course. Hey, Frank? Of course, for that I didn't-- Frank? What's going on? Frank, yes, I'm going to need you to add another robot part to your demo. This is Brett. That's my manager.
That'd be great.
As usual, management, last minute requests. I'll make sure you get another copy of that memo. I'm not going to come in on Saturday for this.
Alright, well what do we have here? We have a motor. Alright, let's see. It might just work. Let me try this.
Do I get away with that? No, it can't really read this object. Perhaps so? It's -- no, it's not a stepper motor. So, that's a bug. I guess we need to fix that.
Who wants to fix this? Who wants to fix this? Alright.
So, what I have to do now is I have to take some pictures of the aforementioned, motor. So, I need to go to the studio and you know, set up the lights, or I use my favorite camera that I already have here.
Now, let's see. We're going to take a bunch of pictures of our motor. And it's kind of important to just kind of vary it and don't have anything else really in the frame.
So, I'm going for a . I need to have at least ten different images. Good choices. I always just like to put it on a different background.
And make sure that we get a few captured here. Perhaps I'm going to hold it a little bit in my hand. Okay, so we have now a number of images.
Now, I'm going to go over to my Mac and actually show you how to actually do then the training work for that. Okay, I'm bringing up my image capture application and let me for a moment just hide my .
If I now look in my Finder, you can actually see I have my training set, which I used already earlier to train and model that I've been using in my application.
And I need to create now a new folder, and let's call that Servo. And from Image Capture, I can now simply take all the pictures that I just captured and drag them into my Servo. Alright, so now we have that added. Now, I need to train my model again since my manager just ruined what I've done earlier. Okay.
I used a simple scripting playground here. Not the UI, just because well, this might be something I want to later on incorporate as a build step into my application.
So, I'm pointing it simply at the data set that we just added our folder to, and I'm just simply going to train my classifier, and in the end, write out my model.
So, what's going to happen now as you can see, we're off to the races.
It's going to go through all these images that we've already put into our folders and extracts the scene print from that. It does all the scaling down that has to happen and will then in the end, train and model based on that. So, it's a pretty complex task, but you see for you, it's really just one line of code, and in the end, you should get out of it, a model that you can actually use in your application.
Let's see as it just finishes. We're almost there. And voila, we have our model.
Now, that model, I've already referenced in my robot shop application. This is what we see here now. As you can see, this is my image classifier. It's 148 kilobytes. That's smaller than the little startup screen that I actually have. So, one thing I want to highlight here already, and we're going to go into that a little bit later. So, this image, that I need to pass into this has to be of a-- a color image and a 299 by 299 pixels. Strange layout but this is actually what a lot of these classifiers will do.
So, now I have hopefully a model that will understand it. Now, I need to go into my sophisticated product database which is just a key list.
And I'm going to add my Servo to that.
So, I'm going to rename this one here. This is Servo. I'm giving it a label. This is actually what we're going to see. This is a servo motor and let's say this is a motor that goes swish, swish.
Very technical. Alright.
Let's see if this works.
I'm going to run my application now.
That's -- okay.
Let's try it. There's our servo motor. Just to put this in perspective. This was a world first that you saw. A classifier being trained live on stage, from photos, all the way into the final application. I was a bit sweating about this demo .
Now we've seen how it works. There's a few things I want to highlight actually when we actually go through the code. So, I promise, we're going to live code here a little bit. Okay, let me take everything away that we don't need to look at right now. And make this a bit bigger. Okay, so, how did I solve all this? So, we started, we actually created a Sequence Request Handler. This is the one that I'm going to use for my registration work, just as Sergei already explained in the earlier session, this is good for tracking objects.
I will create my request, put them into one array, and now what do you see here for the registration? Just keeping like the last 15 registration results, and then I'm going to do some analysis on that and see if actually I'm holding still.
I'm going to keep one buffer that I'm holding onto while I'm analyzing this. This is actually when I run my classification.
And since this can be a longer running task, I'm actually going to run this on a separate queue. Alright, here's some code that I actually just used to open. This is actually like the little panel that we saw. But the important part is actually, "So, how do I setup my Vision task?" So, I'm going to do two tasks. I'm going to do a barcode request and I'm going to do my classification request. So, I set up my barcode request.
And in my completion handler, I simple look at, "Do I get something back?" and also just since I'm only expecting one barcode, I only look at the very first one.
Can I decode it? If I get a string out of it, I use that actually to look up -- that's actually how it worked with my barcodes to see then -- oh, yes, that is my training . Alright, so I add that as one of the requests that I want to run. Now, I'm setting up my classification. So, in this case, what I've done, I used my classifier and I'm loading simply this from my bundle and create my model from there. Now, I'm not using the code completion from Core ML in this case, because this is the only line of Core ML that I'm actually using my whole application, and it allows me to do my custom kind of error handling. But, you can choose also to use the code completion already from Core ML. Both are absolutely valid.
Now, I create my Vision model from that. My Vision Core ML Model, and my request. And again, simply when the request returns, meaning I'm executing my completion handler. I simply look, "What kind of specifications did I get back?" And then I set this threshold here. Now, this is one that I empirically set against a confidence goal. I'm using 0.98. So, a 98% confidence that this is actually correct. Why am I doing that? That allows me to filter out actually when I'm looking at something, and maybe not quite sure what that is. Maybe we'll see that in a moment, actually what that means.
So now, I have all my requests.
When it comes to the time that I actually want to execute them, I created a little function here that actually mean, "analyze on the current image." So, when it's time to analyze it, I get my device orientation, which is important to know how am I holding my phone.
I create an image request handler on that buffer that we currently want to process.
And asynchronously, I let it perform its work.
That's all I have to do basically for actually doing the processing with Core ML and barcode reading.
Now, a few things, just okay. How do I do the scene stability part? So, I have a queue that I can reset. I can add my points into that.
And then simply I had created a function that allows me to look basically through the queue of points that I've recorded. And then setting like, "Well, if they all sum together, only show a distance of like 20 pixels." Again, that's an empirical value that I chose. Then I know the scene is stable. So, I'm holding stable. Nothing is moving in front of my camera.
And then comes our part of catch the output. So, this is the call that actually AV Foundation calls buffers from the camera.
I'm making sure that I you know, hold onto the previous pixel buffer because that's what I'm going to compare against for my registration, and swap those out after I'm done with that.
So, I create my Translational Image Registration Request with my current buffer. And then on the Sequence Request Handler, I simply perform my request.
Now, I get my observations back. I can check if they are all okay. And add them into my array.
And last but not least, I check if the scene's stable.
Then I bring up my little, yellow box which is detection overlay.
I know this is my currently analyzed buffer.
And simply ask it to form its analysis on that. The one thing that you saw that I did at the end of the asynchronous call, in the currently analyzed buffer, I released that buffer. And you see here that I check if there is a buffer currently being used. Now, that allows me to make sure that I'm only really working on one buffer, and I'm not constantly queueing up more and more buffers while they're still running in the background, because that will starve the camera from frames.
Alright, so when we run this, there's a few things I want to highlight actually. So, let me bring up Number 1, our console here a little bit on the bottom. And you can actually see, when I'm running this at first, and so, I'm going to run this right now here. Hopefully you should actually see something.
You see that the confidence scores are pretty low because I'm actually over the . It's not really sure what I'm really looking at. The moment actually I point it at something that it should identify, boom, our confidence score goes really high and that's actually how I'm sure that this is really the object I want to show.
Now, another thing I wanted to demo, let's look actually what happens in terms of the CPU. Alright, so right now, I'm not doing anything. I'm just showing my screen.
So, when I'm just moving the camera around, scene is not stable, I'm using about 22% of the CPU. Now, if I hold it stable and actually the classifier runs, you see how the CPU goes up. And that's why I always recommend only run these tasks when you really need. Alright, that was a lot to take in. Let's go back to the slides and recap a little bit what we have just seen to now.
So, go right to the slides. Okay, recap. First thing, how did we achieve our scene stability? We used a sequence request handler together with the VN Translational Image Registration Request, to compare against the previous frame.
Out of that, we get our translation as of terms of an alignment transform which tells me the X and Y of like how the previous frame has shifted to the current one.
Then we talked about that we want to only analyze the scene when it's stable.
And for that, we created our VN Image Request Handler off the current buffer. And we passed in together both the barcode detection and the classification. So, that allows Vision to do its optimization underneath the covers and can actually perform much faster than if you would run them as separate requests. Next was the part about thinking about how many buffers do I hold in flight? And that's why I say manage your buffers.
Some Vision requests, like these convolutional networks, can take a bit longer. And these longer running tasks are better to perform on a background queue, so that or whatever you do in the camera, can actually continuously running. But to do this particularly with the camera, you do not want to continuously queue up the buffers coming from the camera. So, you want to drop busy buffers. In this case, I said I only work with one. That's actually in my use case scenario works pretty well. So, I have a queue of 1, and that's why I simply held onto one buffer and check, as long as that one is running, I'm not queuing new buffers up. Once I'm done with it, I can reset and work on the next buffer. Now, you might ask, "Why am I using Vision when I can run this model in Core ML? It's a Core ML model." Well, there is one thing why this is important to actually use Vision. Let's go back and look what we saw when we looked at our model. It was the strange number of 299 by 299 pixels. Now, this is simply how this model is trained. This is what it wants to ingest.
But our camera gives us something like 640 by 480 or larger resolutions if you want.
Now, Vision is going to do all the work by taking these buffers as they come from the camera, converts it into RGB and scales it down and you don't have to write any code for that. That makes it much easier to drive these Core ML models for image requests through Vision. So, that was image classification. Next, we talk about object recognition. Now, a little warning. In this demo, actually a live croissant might actually get hurt on stage. So, for the squeamish ones, please look away.
So, what we're using for our object recognition is a model that is based on this YOLO technique, You Only Look Once. That is a pretty fast-running model that allows us to get the bounding boxes off objects and a label for that. And it finds multiple of them in the screen. As you see in the screenshots.
The advantage of those is that I get actually like where they are, but I won't get as many classifications as I can do with like our overall image classifier. The training is also a little bit more involved, and with that, I would like to actually refer you to the Turi Create session that was, I believe, yesterday, where they actually showed you how to train these kind of models. These models are also a little bit bigger. So, how does this look? Let's go over to our demo. Robot shop is closed.
Time for breakfast.
Let me bring up my quick template here and I have my new little application. It is my Breakfast Finder. And what do we see? Oh, we have a croissant, we have a bagel, and we can identify the banana.
And see, they can all be kind of like in the frame. I'm going to detect them.
So, some mention in these cooking shows, normally they show you how to do it, but then pull the prebaked stuff out of the oven.
Well, this model, actually has been baked way before this croissant has been baked, and I can prove this.
It's fresh. And still a croissant. Alright.
It's fresh but still, got to chew.
Let's look quickly how this looks in the code. So, what did I do differently? actually in the terms of like setting up my request.
All I have to do is use my Core ML model, just as I did in the previous example, create my Core ML request, and afterwards, I'm looking actually at simply, "How do I draw my results?" Now this is where we have something new to make this a little bit easier.
And when we look at all of this, we get a new object that is the recognized -- Recognized Object Observation, and out of that, I get my bounding box, and my observation of like the labels.
Now, that's one thing I would like to show you here.
Let's run our application from here and I put a break point.
Alright, we are now on our break point.
So that I only look at the first label. So, what we are doing when we actually process these results, I'm going to take this, let's try object observation, labels. So, what you actually see is that I get more than one back. I get my bagel, my banana, coffee -- I didn't bring any coffee today. Sorry about that. And croissant, egg, and waffle. Now, they are sorted in the order of like the highest confidence on the top. Usually, this is the one that you're interested in. That's why I'm taking the shortcut here and just looking at the first one. But you always get all of the classifications back in terms of an array of the ones that we actually support in the model.
That was our Breakfast Finder. Let's go back to the slides, and this time, I'm pushing the button. Good.
So, we made this possible through a new API and that is our VN Recognized Object Observation.
It comes automatically when we perform a Core ML model request, and if that model is actually using an object detector as a space.
Like in this example, it is based on the YOLO based models. Now, you might say, "Well, I could have already run YOLO like last year. There were a bunch of articles that I saw on the web." But look at how much code they actually had to write to take the output of this model, to then put it into something that you can use. And here, we only had a few lines of code. So, it makes YOLO models really, really easy to use now. Let's go once more over this in the code. So, I create my model.
From the model, I create my request.
And in my completion handler, I can simply look at the area of objects, because we saw we can get multiple objects back. I get my labels from that, my bounding box, and I can show my Breakfast Finder. Now, there's one more thing I would like to highlight in this example.
You saw how these boxes were a little bit jittering around because I was running the detector frame by frame, by frame, by frame. Tracking can often be a better choice here. Why? Tracking is much faster even in terms of like running, than actually these models would run.
So, redetection takes more time than actually running a tracking request.
I can use the tracker basically if I want to follow now an object on the screen because it's a lighter algorithm. It runs faster. And on top of it, we have temporal smoothing so that these boxes will not jitter around anymore and if you see some of our tracking examples, they actually move nicely and smoothly across the screen.
If you want to learn more about tracking, the previous session from my colleague Sergei, actually talks about how to do all the implementation work of that. Alright, last but not least, let's enhance our Vision mastery and go into some of our fundamentals.
Few things are important to know when dealing with the Vision framework.
First and foremost, and this is a common source of problems, image orientation.
Now, not all of our Vision algorithms are orientation agnostic. You might have heard early that we have a new face detector that is orientation agnostic. But the previous one was not.
This means we need to know what is the upright position of the image? And it can be deceiving because if you look at it and preview on the final, my image looks upright, but that is not how it is stored on disk.
There is something that tells us how the device is oriented, and this is called the EXIF Orientation.
So, if an image is captured, that's normally in the sensor orientation, with the EXIF, we know what is actually upright and if you pass an URL into Vision as our input, Vision is actually going to do all that work basically for you and actually read this EXIF information from the file.
But like when -- as we showed in the demos earlier, if I use my live capture feed, I need to actually pass this information in. So, I have to look at what does my orientation from my UI device current orientation and convert this to CG Image Property Orientation because we need it in the form of an EXIF orientation. Next, let's talk a little bit about our coordinate system.
For Vision, the origin is always in the lower, left corner. And all processing is done in the up right -- if the image would be in an upright position, hence the orientation is important.
All our processing is really done in a normalized coordinate space, except the registration where we actually need to know how many pixels . So, normalized coordinates means our coordinates go from zero, zero, to 1,1, in the upper right corner.
Now what you see here is to that performed face and landmark detection request. And you will see that I get the bounding box for my face, and the landmarks are actually reported in relative coordinates to that bounding box. If you need to go back into the image coordinate space, we have utility functions and VNUtils like the VN Image from normalized way to convert back and forth between those coordinates.
Next, let's talk about confidence scores. We touched a little bit on this already during our robot shot example. A lot of our algorithms can express how confident they are in their results.
And that is kind of an important part to know when I want to analyze later on what I get out of these results. So, if I have a low confidence of zero, or do I have a high confidence of 1? Clicker. Here we go. Alright. Unfortunately, not all algorithms will have the same scale in terms of like how they report their confidence scores. For instance, if we look at our text detector, it pretty much always returns a confidence score of 1 because if it doesn't think there's text, it's not going to return the bounding box in the first place.
But as we saw, the classifiers have a very large range actually of what this confidence score could be. Let's look at a few examples.
In my first example, I used an image from our robot shop example and I ran my own model on it. And sure enough, it had a very high confidence, this is a stepper motor.
Now, on this next examples, I'm going to use some of the models that we have in our model gallery.
So, don't get me wrong. I don't want to compare the quality of the models. It's about like actually what did confidence they return and what actually want to do with this.
So, what did this tell us basically when we want to classify this image? Well, it's not that bad, but it's really sure about it either. The confidence score of 0.395 is not particularly high, but yes, it has a sand part, there's some beach involved. So, that's usable basically as a result when I want to search for it, but might I label the image with that? It's probably questionable. Let's look at this next example.
Girl on a scooter. What did the classifier do with this? Well, I'm not sure she's so happy to be called a sweet potato. Let's look at one more example. So, here's a screen shot of my code.
What does the classifier do with that? It thinks it's a website. Computers are so dumb.
So, some conclusions about our confidence scores.
Does 1.0 always mean it's a 100% correct? Not necessarily. It will fill the criteria of the algorithm, but our perception, as we saw particularly with the sweet potato is quite different.
So, when you create an application that wants to take advantage of that, please keep that in mind. Think of it. If you would write a medical application and saying, "Oh, you have cancer," that might be a very strong argument where you want to probably soften that a little bit depending on like how sure you can really be on the results.
So, there are two techniques that you can use for this. As we saw in the robot shot example, I used a threshold on the confidence score because I really label the image and I you know, when it's actually filtering everything out that had a low confidence score. If on the other hand, I want to create a search application, I might actually use some of the images that I had and still show them, probably on the bottom of the search because there's still a valid choice, perhaps in the search results.
As usual, we find some more information on our website. And we have our lab tomorrow at 3 p.m. Please stop by. Ask your questions. We're there to help you. And with that, I would first of all like to thank you all for these great application that you create with our technology. I'm looking forward to see what you can do with this. And thank you all for coming to WWDC. Have a great rest of the show. Thank you.
[ Applause ]
Looking for something specific? Enter a topic above and jump straight to the good stuff.
An error occurred when submitting your query. Please check your Internet connection and try again.