Transcript of talk done august 7th, 2017, writeup/video here: https://academy.realm.io/posts/brett-koonce-cnns-swift-metal-swift-language-user-group-2017/

Thank you all for coming. I’m going to do a talk on convolutional neural networks, Swift and a little bit of iOS 11 using the new APIs. Does anyone have any experience with machine learning here? Maybe about half of you. Any computer vision related stuff? Just a couple. My background is a long story, but I went back to school to get a master’s degree and I ended up in the computer vision lab at the University of Missouri with Professor Pal there. That’s how I got into a lot of this stuff. But I’ve been recently playing with it again, and I’d like to save you all the trouble of learning all this stuff that I had to in order to get back up to speed.

Our goal is image recognition in some form or another on a mobile device. We’re mobile programmers, so we don’t have the cloud or JavaScript or any of that stuff to magically save us, we just work on real devices in the real world for real people. Towards that end, I’m going to do a quick background on machine learning and neural networks, then we’ll do a demo of how to combine Keras, TensorFlow, and CoreML which is a new part of iOS 11 to do image recognition on a device. Then from there, we’ll go into some more advanced convolutional neural networks, different models, discuss different training and production improvements you can make in order to speed things up, and hopefully, in the end, I’ll leave you with some shiny things to play with.

So, what is machine learning? I think this XKCD comic sums up the basic idea. You take some input data, do some magic with math, and then you get the right answer. And if not, then you wave your hands and say you need to redo your models.

More specifically, we take an input which can really be thought of as a set of numbers, but if you look back a little wider — what is a picture? A picture is just a very large collection of numbers, a large array. The same way you can take audio and convert it into a histogram so that basically means audio is a fancy form of a picture, which is a fancy version of a bunch of numbers. What is video? It’s a bunch of pictures plus some audio, so it’s just a fancy form of a bunch of numbers. So we take a bunch of random numbers — we have known data, we have cat pictures versus dog pictures. We want to combine them together to form a magical black box. So we have a cat picture that says cat, and a dog picture that says dog. We don’t care how it figures it out, just so long as it actually works. So towards that end, usually you have to train your model with lots of samples. Finally at the end, you want to run it on some unknown data, and hopefully, your unknown picture of a cat becomes labeled as a cat at the end. So towards that end, our goal is either the quality of our model or how accurate it is on our data. We can also think of the size of our model, or how many CPU cycles it takes to run.

The first part of the demo we’re going to do is the MNIST data set. This is like Hello World with computer vision. This is the data set from the ’90s, and it consists of an array of pixels, 28x28. They’re all grayscale. This is what a hand-drawn #1 would look like, and these are basically what the actual number representations are. There’s a bunch of different ways to tackle the MNIST problem. If you’re interested in doing this, I’ve tried to do this, I’ve literally tried doing this with a dozen different algorithms, it’s a good easy way to get going. So we’re going to build a neural network. Basically, a neural network is going to try to learn an activation function. This is an example of an activation function for a very simple neural network that can only recognize 0’s and 1’s. The blue area would be the activation would be the activation area, and the red area would be negative. So for the 0, it overlaps a lot with the blue, so it will bump that up. Then there’s nothing in it that’s actively overlaid for the red portion, so this will score high for a 0. If you take 1 on the other hand and put it into the same function, it overlaps slightly with the blue region, so it will have a plus there. But because of the red region, it will knock out that section. So, this is probably not a 0. For 0’s and 1’s, this is kind of easy to understand and follow what’s going on for activation maps.

But if wanted to do all ten digits, the numbers 0–9, all of a sudden, our activation functions start to look much more complicated. So the basic neural network that we’re going to build simply takes our input digit from the MNIST digit data set, adds it to two fully connected neural network layers, and finally, there’s a classifier layer which categorizes it as a digit from 0–9.

So for the first part of this, we’re going to build a neural network that does all this. Our goal is to do MNIST recognition on our device, or more broadly, just iOS. We’re going to use Keras which is a high-level library in python to build our network. We’ll use TensorFlow, which is a matrix math library from Google to do our actual calculations, and then we’ll deploy our model using CoreML. I wrote this code a couple months ago to teach myself the new APIs after they came out at WWDC. You can download it on the internet if you want, but I’ll walk you through it now.

This is the basic script right here. We have some python imports and some parameters. This basically prepares our data set for doing our actual math. This converts our categories into vectors, and finally, we build a model right here. We use Keras to then train it on our input data set, and then finally we have a little bit of code right here — Apple released a tool to convert Keras and other models to CoreML, so it does nothing more than just save the model as a file. We’ll run the script here…in the demos I’m doing today, everything can be done on a laptop. You don’t need a full-blown GPU or anything.

So we built the model, it has about a 98% accuracy on the MNIST data set, which is pretty good. Finally, we did this trick here which exported out the file. Now we have this model right here, it’s like 2.7 MB. Then I built a very simple demo project right here. Basically, we have a couple of input digits from the MNIST data set right here. We just run the loop, it runs the CoreML model against the data that’s inputted, and this is basically how you set up a CoreML model using iOS 11. It inputs the data, this step normalizes it and prepares it, then the actual prediction step, and then an ugly function to do a SoftMax to pull out the input digits. If you do this yourself, you should be able to get it working. I don’t want to hunt this down right now. But it does work, I promise.

Just a recap…we built a neural network using Keras, we trained it with TensorFlow, we exported the CoreML file, and then we reimported it and used it in iOS 11 to actually perform the recognition step of the device.

From here, we get into convolutional neural networks.

So what is a convolution? It’s basically a fancy word for matrix math. You can think of it as being nothing more than a[x] + b, where the a[x] is a matrix. You can play all sorts of different games with these things and do some interesting tricks. For example, this is an example of a convolution — the center pixel is taken out, it pulls a little bit from the two neighbor pixels and a little bit from some further pixels which results in a blur. This is another convolution over here — one is applied, then the other. Basically, the result of these two convolutions is to become an edge detector. These are just examples, you don’t have to know them.

The two convolutions you need to know — one is called striding. It basically takes an image and breaks it up into a bunch of smaller little chunks. There’s some good documentation from this online course that you can read. It’s hard to explain what’s going on.

Maxpool is a pretty simple function. The problem with striding is that it produces a lot of samples, so we want to squash it back down. Striding will take this group of 16 pixels here and goes through it and decides which is the biggest pixel. That’s what makes it to the next generation. So we take 16 data points and we reduce it to 4.

If you take these two ideas of striding and maxpool, you can combine together to form your first real convolutional neural network framework for image recognition. This is the network called the Vggnet (2014). It starts out with two 3x3 strides, a maxpool, two more 3x3 strides, another maxpool, three 3x3 strides, maxpool, three 3x3 strides, maxpool, three 3x3 strides… For the final two layers, for my demo, we did a 512 node fully connected layer. Then VGG is just using a larger number, which is just 4,000 or 4k. You have two of those, and the last one — instead of classifying the digits from 0–9, which would be 10 categories, now we have 1000 categories because we’re applying this to the imagenet data set. This picture is a visual of what’s going on with the data in each step. It’s getting squashed down, and then finally we have the two layers that do our neural network tricks, and then we perform our prediction.

So here’s a demo of a vgg network running on a phone. I’m just using an input image on here, so it’s a picture of a cat. Basically, it runs and identifies that we have a cat picture here. The problem is that it took about 3–5 seconds to actually run. It’s because of a couple things, but one of them is that the vgg network has about half a gigabyte of weights on it. So we have to crunch through half a gigabyte of data in order to make that prediction, and then it uses a lot of CPU cycles in general. It’s not very efficient with how it’s doing its CPU. It works, and we’ve done image recognition on the phone, but it’s kind of slow.

If you want to see the source code on this, this is a friend of mine named Matthijs Hollemans. He’s written a whole set, so if you go to GitHub, you can download a bunch of these. I’ll be demoing a bunch of his stuff.

So where do we go from here? Basically, we’ll think of what we’ll be doing next is either one: making different models — we can use different architectures. Two: we can do different training methods to speed up how we’re building our networks. Or three: we can look at how the code is actually executed on the device. Finally, we can ask ourselves what our expectations are. Are we willing to trade a little bit of quality for speed or performance here or there as needed?

Next, we’re going into the inception networks. The first important concept of inception is the idea of parallel execution. We’re running convolutions, but we’re doing them in parallel. The basic inception nodes are these four nodes working together. The problem with this is that it doesn’t actually work that way out of the box. In order to make the inception modules work, you have to add this 1x1 dimensionality reduction step. Here’s a website where you can read more about that.

Basically, you take these inception nodes, and you start to go deeper. This is the original architecture, and it’s a whole bunch of different nodes working together. We’re doing more math, but the practical upshot is that we can have much less weights for our network. If vgg is about half a gigabyte worth of data, inception is a little over 100 Mb. Here’s another good guide for what to do there.

While we’re doing this, we’re going to go through a couple of production improvements. Transfer learning is an important technique for you to understand. You can take a model that’s been trained on one thing and modify it slightly to run a different one. So we can skip a whole bunch of a week of GPU time if we’re just trying to build a quick model.

Then finally, we can take our model and optimize it for actual execution. We can take an inception v3 graph model and we can prune it, or remove extra nodes. We can reduce it, which is combining any nodes together that we can. Then, we can quantize — most models are built in double math, which is 64 bits. We can convert everything down to 16bit integers, we can take an inception model that’s 100 MB and go to 25 MB just by reducing the precision, or our math. We lose a little bit of quality, but it gives us a much smaller model. Finally, we can align the data which is just doing a memory map of the data to make sure that it’s fitting in on the device in such a way that it runs optimally on run time.

We’ll do inception now. If you want to do this, the best tutorial for this is called TensorFlow for Mobile Poets. It’s by Pete Warden. You can go download the tools, and basically, you can do live video image recognition at about a little under a second a frame. Not quite real time, but we’re under a second for each image recognition step just by modifying our architecture. The other nice thing about this is that TensorFlow is on iOS and Android, so you can do this on the other platform as well. All this together is roughly as state of the art as last year (2016).

So what do we do next? The first concept is residual networks. This is a paper that came out of Microsoft last year (2015). The basic idea is to have the skip layers — we can have layers that bypass the steps as we go down. This first model is a vgg 19, which is the cousin of the vgg 16 that we were looking at before. Basically, if you were computing this graph and you got to the last step, you’re throwing away all the data that came before. So residual networks make the final score actually be a combination of all the prior layers. This allows you to do much deeper training. As a result, people have built 200-layer networks and 1000-layer networks with this methodology. This is probably the current state of the art for high-end stuff if you’re wanting to make an image recognition of your own, not running on mobile. I have another demo of resnet, this one is boring as well because it’s just running on an input image, the same concept. This one runs roughly at a similar speed to inception but has slightly higher quality.

Finally, what I’m excited about is called mobile nets. This is a paper released by Google back in April (2017). It makes heavy use of a concept called depthwise separable convolutions. It’s a relative of the 1x1 dimensionality reduction part of inception. If you look at our graph, the large gray blob is our vgg 16 model from before — then these little networks here are all the mobile nets. You can think of this as being a graph with an axis like this. This axis is how many CPU cycles it takes to run your model, and the other one is the quality or accuracy on these various image recognition graphs, and the last is the size of the circle itself. The smaller the circle, the smaller the model. The smaller models over here are really cool to me. These are about 7 MB or so. It’s getting over 50% accuracy on ImageNet. It’s punching way above its weight, we’ll phrase it that way. The second cool thing about this is if you’re running off of TensorFlow if you point your TensorFlow to the latest master, you can rerun the inception training script from before just by adding this parameter at the bottom. Then, that will produce a mobilenets version of your model. By the same logic, you can modify the TensorFlow iOS and Android libraries in order to run mobilenets on the TensorFlow libraries on your device. So that’s one way you can run mobilenets today. Another way is that if you look through Matthias’ code, he has a fully working version of mobilenets that’s just pulling data off the internet. The third one is, I’ve downloaded Caffe Model and converted that into a CoreML model, and I’m now running it on the phone against the live input stream. That looks like this. The quality of this one probably isn’t as high as the other one. We’re running over 10 frames per second right now without actually having to do anything too cool on top. This other slide is showing the relative performance of the higher end circles — the large gray one corresponds with the large green one, and the Google meta over here is a little blue dot, just to give you an idea. These are probably the current cutting-edge technologies — Resnet and a couple different inception models there. That’s all I have for convolutional neural networks per se.

If you’re interested in this subject, there’s a closely related subject which is just object detection in general. There’s a bunch of different tools for this. YOLO is popular. Here’s a demo of YOLO running on a device — I don’t have any cool stuff in the background for it to be finding, but trust me, that one works. SSD is another important algorithm in this area. SLAM is technically part of ARKit, so that’s kind of like Apple’s put that into the iOS already for us. If you’re familiar with RNNs, you can combine RNNs with convolutional neural networks in order to produce R-CNNs, and you’re getting into crazy town there, let’s just put it that way. If you’re interested in this subject, I would advise you to not only do neural networks — you should be familiar with some traditional machine learning techniques, very specifically, Random Forests and Support Vector Machines are very powerful old school tools that at the very least, you should know how they work. What I’ve been doing the last couple months is just downloading random models from the Caffe Model Zoo, and then converting them into CoreML models and then running them on the device that way.

That’s about all I’ve got — thanks for coming!

Any questions?

*If I want to build an iOS app that can identify objects from the live camera feed, and in my first attempt I find that it’s going too slowly, and I’m using the optimization techniques that you’re showing — am I doing all my optimizing in the python code before I export the model?*

Yeah, or the model itself would be the optimization. There are some tricks you can do to run things faster on the device, but we didn’t go into them here.

*Can you run this on iOS 10 or is it iOS 11 only?*

The inception demos can all be run on iOS 10 or before. The ones that are using CoreML are iOS 11 tricks. You need iOS 11 for that.

*Facebook released a library called Caffe2Go — have you played with it, do you have any samples?*

I’m aware of it and I’ve messed with it a little, but I haven’t done anything too exciting with it yet.

vggnet https://arxiv.org/abs/1409.1556

resnet https://arxiv.org/abs/1512.03385

inception https://arxiv.org/pdf/1409.4842.pdf

mobilenets https://arxiv.org/abs/1704.04861