A presentation for Scale by the Bay about different approaches for bringing machine learning to mobile devices, then how to build an end-to-end pipeline using swift for tensorflow and mlir to train and deploy models to a phone.
Video • Slides • Code • Interview
Hello, I’d like to thank you all for coming and thank Alexy for inviting me here. Today we’re going to talk about machine learning and mobile.
At a high level, we’ll look at the specific problem of image recognition on mobile or edge devices. We’ll review the current state of the art in this field, and then we’ll zoom out and look at where things are going. I’ll do a demo to tie these concepts together, and then at the end, we’ll do a quick recap.
Edge devices (0:45)
What is the edge? Phones are a really good example of this. It’s something everybody has in their pocket, more or less. But I think of it more broadly than that. Any moving computing platform, we’ll say. Autonomous cars are an example, but really, any sort of sensor out in the real world is part of the edge, I think. We might even think of a satellite up in space as being one, some remote server that we can ping.
Fundamentally, whenever we deal with edge devices, we have the following assumptions – fundamentally, our computing power is significantly less than the cloud. Fundamentally our bandwidth is limited at best, and usually unreliable. We might only be able to talk to a server once a day, and it might be on a random schedule, such as when a person pulls a phone out of their pocket. In general, we have to be very efficient with our power. We can’t run things at 100%, and generally speaking, we need to make decisions quickly. We have some sort of concept of bounded or interactive decision time. So to use an example of a self-driving car, your car is at an intersection and it has to pick a direction to go. It can’t consult an oracle.
So given these limitations, why do we do stuff on edge devices? People talk about security or privacy or things like that, but I think they’re walking past what they really should be saying. So we take resnet or ImageNet, or any of the other classical computer vision networks – their input is a small swatch, it’s 224x224 pixels. Last year, Google published a paper called gpipe – whereby using a cloud supercomputer, they were able to get this input size up to almost 500x500 pixels wide. Meanwhile, the latest generation of iPhones generates 4k video at 60 frames a second. So to me, why do we do stuff on the edge? The answer is very simple – that’s where the data is. People talk about machine learning and how data is everything, you got to have the data. So in theory, working on edge devices, we have access to an order of magnitude more data than the cloud people do. And by extension, we should be able to do things that they can only dream about.
Existing approaches (3:20)
First, we’ll go through the existing solutions in the field. CoreML – this came out of Apple a couple of years ago. They’ve worked very hard on this. iOS 13 came out last month, and CoreML 3 is a very solid update to this whole platform. They have a python tool called CoreML Tools that lets you export models from XGBoost, SciKit-Learn, and Keras. Then Apple has another set of tools that a lot of people haven’t seen called TuriCreate. It’s taking the concepts of these above packages and throwing Numpy in there, and rewriting them all into a package that runs on top of Metal, which is Apple’s graphics programming language. The basic limitation of CoreML is very simple. It’s iOS only. On the flip side, Apple has spent a lot of time optimizing it, so it runs really fast. If you’re new to this whole field and you had a little bit of Swift experience, I think this is the best place to get started.
I did a talk a couple of years ago on how to build CoreML models using MNIST and Keras as a basic example, so if you’re interested in seeing how that approach works, you can check that out.
Tensorflow-lite – if cross-platform compatibility is your biggest issue, this is what you should look at. They have libraries for iOS, Android, they have a Raspberry Pi library and in theory, you can get it working on pretty much anything you would like. At a high level, the process of building a Tensorflow-lite model, we start off with some sort of Tensorflow model in that world and we have our graph definition, and then we try to convert our graph into tensorflow-lite operations. If there’s a 1 to 1 correlation between your tensorflow model and your tensorflow-lite, then this process is easy – if not, then you’re probably going to have to start making some tradeoffs and start looking at the neural network operands themselves.
I was going to run one of the tensorflow-lite demos, but my XCode is acting up, so I’ll save it for the end. But basically, we have the video feed off of the phone and we have the mobilenet computer vision network from a couple of years ago just running on the device.
Last month, Pytorch came out with version 1.3 where they have now finally added iOS and android libraries that you can run yourself. At a high level, you take your Pytorch model, you run a JIT trace on it, it produces a .pt model that you can run on this device using the local libraries. I have not played with this a whole bunch, but I got it working and I was really impressed with the speed. If you’re open to the land of Pytorch, I would definitely recommend that you do this. We’re just running a simple image classification network on the phone, the standard thousand ImageNet categories, but the speed is really impressive. So 30 milliseconds, and this is a pretty old phone.
This is less of the world of phones, but embedded Linux should always be on your radar. Basically, you can bring whatever hardware you want and then run whatever software you need. This is nice for a lot of reasons. The simplest is that if you have very specific ram or CPU requirements, you can speed up the device to be fast enough for what you need. OpenCV – if you’re going to give me one computer vision trick, I would say just being able to type import OpenCV
is a really good one. Basically, everything under the sun is in there. People have made all these various libraries, D-Lib, I still see people compiling Matlab down to custom C++ stuff. Basically, you can run whatever you need to do on the device.
In general, I also use this as a prototyping platform. I’ll write a demo or something and say, a few hundred lines of python, make sure I have a high level of understanding of the problem and then I’ll start to think about how to actually get it to run on the phone. A lot of times, you have two problems. One is, can we do this? Two is, how do we make this fast? Oftentimes, figuring out the “can we do this” part will allow you to either simplify your life significantly because you can avoid going down false paths. Then finally, more and more custom hardware is starting to come to the market now. People are able to target arbitrary integer and floating-point depths. There are custom ASICs and macs coming to the market. We might even view Google’s quantum computing announcement from last month as seeing the dawn of probabilistic processors becoming a reality. A lot of people in the space have a mentality that if they can build the fastest chip, that developers will flock to their platform. I don’t think that’s what’s going to happen. I think fundamentally on day one, whatever cool new hardware you bring to market, you need to have a way for people to get their existing workflows and patterns on to your device. To me, while this hardware is cool, it’s just really exposing the new limit, which is software.
Here’s a picture illustrating the differences between CPU, GPU, and TPU style devices. I think this caption is hilarious – “CPU, GPU, and TPU-like accelerators require different on-chip memory architecture and compute primitives. This divergence must be addressed when generating optimized code.” There it is, that’s all it is.
Tensorflow today (9:40)
Now let’s look at the whole software piece. Here’s where Tensorflow is at today. We have various high-level languages. Swift is just one. We generated a graph, and we have these various routes to get your model out and actually run it on some sort of device. This is just me speculating, but we’ll say that if you look at this star, it’s my belief that Google’s long term vision is to get everything to go through this LLVM ecosystem. So TPUs will have some sort of LLVM thing to generate code for them. TF-lite will have some sort of LLVM code to generate that. We could maybe think of WebGL as being a potential output of the LLVM run time. I think in the short term, this is going to be a painful transition because they’re going to have to basically redo years’ worth of work to get here. But in the long term, this is going to make the Tensorflow ecosystem extremely flexible and they will be able to work with whatever new devices come to the market.
Take a look at this grappler – this is next. Grappler is a tool to speed up run times. These are a couple of pictures illustrating how it works at a high level. We have data conversion steps on the graph, grappler basically looks at the graph as a collective whole, it does a minimum spanning tree-style approach, and then produces a simpler graph. The practical upshot of this is that things run much faster. So it’s like a high-level top-down approach.
On the other end, if you go down to the LLVM level, you’re trying to inline your memory hits – a lot of this machine learning isn’t so much compute-bound as it is bandwidth bound. So if the compiler can make sure the data is getting there on time, it can speed up the results significantly. You might think of this as being a bottom-up style of approach.
We might try to meet in the middle somehow, and we end up with polyhedral compiler techniques. I talked about this last year as, “Here’s something for them to implement,” but I was underselling how hard of a problem this is. I think we can literally be spending a decade on this particular slide here. This is very much a hard and unsolved problem. There are other people in this field looking at this stuff. I talked a little bit about GANTT charts last year. I liked this picture, it came out of the GLOW compiler paper. Effectively, they are taking the graph as a whole and using it to optimize their run time, which is figuring out what order to schedule things. I think this is an interesting high-level style approach.
To continue that thread, we think of scheduling at a high-level graph level, but it’s also very important to be able to schedule things on the device level. This is from the Mesh Tensorflow paper, which came out last fall. Basically, they are exposing these GPU primitives to the programmers so that the programmer can manually manage where their code is running. This is a powerful approach because it allows you to scale up your code to large clusters. But the flip side is that you’re making the programmer do the work that the compiler, in theory, should be able to do.
People are experimenting with using evolutionary algorithms in order to find the optimal layout or where the graphs are running on the device. This is from a paper where they were using reinforcement learning to try to split up things across 4 GPUs. The upshot of the paper was loosely that the domain expert realized they could put everything on to GPU 3 we’ll say, and that actually ran better. But to me, this is not a limitation of the reinforcement learning, but perhaps the reinforcement learning isn’t being given a fair shake here. I could imagine a scenario in which we told the reinforcement algorithm that the only rule it had is that each device has to be doing something different, so it could figure out this same end result.
This whole area of compiler exploration, or using machine learning to speed up machine learning is very interesting to me in general. Here are a couple of graphs from the TVM paper that came out last year. Effectively, they are running this thing about 500 times we’ll say, in order to start producing significant speedups. For a one-off program, this is probably overkill and you wouldn’t want to do this, but if you’re doing large machine learning jobs and you have a very static problem, then it would be well worth it to spend this time and let the compiler figure out a way to get your code to run 50% faster for effectively free, or the cost of doing a bunch of tests.
This second graph is a roofline plot. The basic idea is that this thick blue line is the theoretical maximum performance of this particular video card, which was the Titan X, I believe, so the basic thing that this roofline plot is showing you is basically this resnet 18 architecture is fitting neatly into both memory and compute on this device – it’s not hitting the roof, so to speak. On the flip side, you can see all this whitespace between our resnet operands and the actual roof, which is to say that it’s not running it optimally. So here’s an area where the evolutionary algorithms, these efficientnet style approaches are able to bin pack the problem into these little holes, and by extension get optimized run times.
The second part of this is that in the future, this roof is going to go up and to the left. So run times that we have today, the networks that we’re using today, are going to change over time. The assumptions that are popular in networks today may not hold tomorrow. So, we don’t know what the future is going to look like. But I think that if you can have everything in one programming language, whatever that language may be, you will be able to adapt to it.
Here’s a picture from the MLIR demos that are in the Toy AST, but you might think of being able to do whatever domain specific language you would like to have – you get it into MLIR, at which point we can start adding all sorts of optimizations on top, and then finally we can output it to whatever device we need. To me, this is where things are going.
Demo (18:20)
Now we’ll try to get out of the realm of theory and into something a bit more practical. Last year, I did a presentation on Swift for Tensorflow and I did a CIFAR demo that notably had no Swift or Tensorflow in it. My goal for this year was to actually use this tool to do this on the device. So the demo we’re going to do is we’ll use Swift to build and train a CIFAR network, and then we’ll actually run it on a device, that is to say, my phone here. However, in order to do so, we’ll have to jump through a few hoops.
We’ll use Swift for Tensorflow to train a CIFAR model. From there, we will use Swift for Tensorflow’s numpy bridge to export our model to Keras, where we can then save it as an H5 file. From there, we can load our H5 file into a Tensorflow session and freeze the graph and save it as a protobuf. From there, we can convert our protobuf using MLIR into a TF-lite file, and then finally we can run the TF-lite file on device. At a high level, I’ll demo how we get from Swift to keras, H5 to protobuf, protobuf to MLIR to TF-lite, and then finally the modifications needed to actually run it on device.
Here’s my CIFAR model. If you haven’t seen much Swift, this is how you would do a simple convolutional network in it. This is just a basic VGG-esque style architecture with two layers of 3 by 3 convolutions, a maxpool, another two layers of 3 by 3 convolutions, a maxpool, two densely connected layers, and then finaly a categorization layer at the end. There’s a little bit of magic here. Basically, Swift for Tensorflow doesn’t support saving models right now, so we’re manually adding it yourself. Swift for Tensorflow has a python bridge, so any python trick is in theory available to us. Our first line is pretty simple – import Keras
. The first top half of this, basically we’re constructing an identical Keras model to the one we built in Swift, and then in the second half, we’re just manually setting the weights of that Keras model based on the results of our Swift run. Then finally at the end, the magic happens. This little model.save – from there, we can load our H5 file into the Tensorflow.keras importer, and then combine the variables and the graph together. This is called freezing the graph. Then we can save the result as a protobuf, so the top is a high level Tensorflow code, to do this at the bottom is just basically A to B to C.
This is probably not a trick that you’ve seen, but MLIR is not theoretical right now. You can actually build and download it today. If you go to the Tensorflow source repository and run Bazel on this first line, this will give you the actual MLIR run times. From there, we have this command line tool, that’s the second line here. We have to do some book keeping for our network, so we manually specific some input and output layers and stuff like that. Finally, at the end, we export a TF-lite file.
I’m doing all of this on top of the custom CIFAR demo from the Swift models repository. I just basically replaced their model file with my own, and then I also removed the normalization step from the data. Then finally, what I have here is the Tensorflow lite image classification demo from their source code examples. I just made the following modifications. We put our file in here – I manually made my own CIFAR labels file here…
I simply point the model reading code to point to my file, we change the input, the width and the height to be 32 to match our CIFAR model, and then finally I added this ugly little line right here, which will print out the results of what the network is seeing. We’re now running our CIFAR model on device. It’s spitting out the results in the console here, as you can see. I’ll hold it up to this dog picture, and it should say dog. It’s saying dog, with 70% to 80% accuracy.
Recap (24:40)
To recap, our goal is to explore image recognition on edge devices. I showed you the current high level approaches in the field. I talked about how I think hardware and software is going to get closer together in the future, and then finally, I demonstrated using Swift and MLIR in order to build a TF-lite model which we then ran on device.
That’s all I have for content. I’d like to thank the MLIR team, they’ve been working very hard on this all summer. The Swift for Tensorflow team has been making steady progress the last few months as well. I have not demo’d any fast AI tricks, but there is a new version in the library that’s due to come out shortly, I’m looking forward to that. I’d like to thank two people in particular – Meir Rosendorff, a gentleman in South Africa - he wrote a really nice blogpost on how to do this numpy bridge with Swift for Tensorflow, and I really found that useful, so I wanted to thank him. Then a gentleman over in Taiwan called Koan-Sin Tan. He’s done a number of interesting technical presentations, but he did one on MLIR and Tensorflow earlier this summer, which is what made me realize that doing things this way was even possible. So thank you to him.
If you’re interested in more on this subject, there are a number of interesting presentations that the LLVM congress did earlier this year. If you’re interested in polyhedral compilation techniques, I stole a number of today’s slides from this presentation by Albert Cohen, but you should look at some more of his papers. He’s been working in this field for more than a decade. Then finally, LLVM is not the only game in town. I think there is some interesting stuff going on in TVM and Glow as well, that’s worth keeping an eye on.
With that, thank you for coming!