convolutional neural networks with swift (and python) [4x]

Presentations at the Google ML Summit in Seattle, StlDevFest in St. Louis, Google ML Summit in Pittsburgh, PA, and the Columbia, MO DevFest on building convolutional neural networks to perform image recognition using swift and python.

Sep 18, 2019:

Seattle Video

Oct 02, 2019:

Pitt VideoSlides

Pitt Transcript


Thank you all for coming, and thank Google for having me. Today we’re going to talk about convolutional neural networks with Swift, and a little bit of Python. We’re going to explore the problem of image recognition. The purpose of my presentation is to take you from zero, to about as basic as you can get in this field, all the way up to the current state of the art. So towards the end we’re going to do a quick review of neural networks, how they work, and we’ll look at a one dimensional version of the MNIST problem, which is a well understood problem in computer vision. From there, we’ll introduce convolutions, and then we’ll tackle MNIST again using 2-D approach. From there, we’ll look at how we can introduce color and start to stack our convolutions in order to tackle larger problems. We can take the same basic approach and add even more layers to build up to VGG, which is our first state of the art approach from 2014 or so. From then, we can modify the VGG basic network to produce residual networks, which are a very powerful modern approach in this field. Then at the end we’ll look at efficientnet, which is a very recent paper in this field. We’ll do a quick demo of running that on an edge TPU device.


These are the four big categories of computer vision that I think you should be aware of. I’ll convert them into the international standard of cat and dog units. Image recognition, or “Is this a cat or dog picture?” Object detection, or “Where is the cat in this picture?” Image segmentation, or “Which pixels are cat pixels?” And finally instance segmentation, “How many sets of cat and dog pixels do we have? Today, we’re just going to be focused on the upper left quadrant here, so just cats and dogs.

Neural networks – machine learning has historically been focused on reducing problems down to the simplest dimension, trying to figure out if there’s just one variable that changes things. So neural networks are kind of like an outgrowth of computer science, they were kind of a curiosity for the longest time. The basic trick that a neural network does is it can learn how to separate high dimensional data. We might think of images as being simple, but to a machine, they’re actually kind of complicated. You have a red channel, a green channel, blue channel, a height and width component, and then you’re trying to map it to some category at the end. So if you could actually just imagine for each input picture, you mapped it to a specific category, that’s literally what a neural network learns. In order to do this, we often end up having to do a lot of math where it’s like, A applied to B, B applied to C, C applied to D, and so on. In order to do this, then we use back propagation and the chain rule from calculus. Everybody hates to talk about the chain rule, so someone said, “Why don’t we have a computer keep track of all this stuff?”

Auto differentiation is not really a new subject in this field, it’s actually from the 1970s or so. What is new is compiling auto differentiation with the compiler in order to model these neural networks at the language level. Swift isn’t really magically special in and of itself. The upper right part of this slide is a slide I stole from Chris Lattner’s keynote presentation at LLVM Conference earlier this year, but it’s basically demonstrating how all these worlds are moving together. Swift’s real secret power is that it’s the first language of LLVM, which is a modern compiler that’s used almost everywhere nowadays. Basically all these worlds are coming together where you can write your high level neural network code in your particular programming language, it will be converted to an intermediate language, and then finally LLVM will spit out to whatever device you actually need to run it on. So right now, people are building stuff for CPUs and GPUs and TPUs, but a new area that’s coming out is running stuff on devices, say on your mobile phone or even on edge TPU devices as we’ll see later.

So the whole theory of this project then is by getting everything to follow this path, you’ll be able to target all these different run times. So same cloud code that you’re writing can run up in the cloud, or on the device in your hands. The second level of this – and this is the really new area – is this whole MLIR. So rather than having each of these languages implement their own abstract syntax for doing this neural network stuff, they’re trying to model it at a cleaner level so all these languages will generate MLIR code, and then from there, we can go from LLVM to your device.

Here at the bottom, we have different forms of basic neural networks. Over here, we have the perceptron – so if you can imagine an imaginary line dividing all your cats and dog pictures, that’s literally what a perceptron is. This is from 1958 – this is not as new as you may think. The basic problem is that you can’t actually reduce the data down to one dimension that easily. Basically they have hidden layer approaches where you run things through a set of neurons, and that’s how you get your actual result. So pay attention to our Deep Feed Forward neural network, and then we can add some convolutions on top, because that’s what we’re going to do for out next two steps.


So the MNIST data set is a well understood data set in computer vision. It’s a collection of hand drawn digits, they’re all black and white so these values are from 0 to 255. They are 28 pixels by 28 pixels wide, so this 8 is just what one of the digits in the data set would look like. We’re not even going to treat this as actual image data. We’re going to unroll it – we’re going to take the top row, and pull off each row at a time, until we have a really long vector. So this second picture right here is demonstrating a 4x4 unrolling loop of say an imaginary 1. We can imagine this same concept across the 28 by 28 pixels to produce an input vector that’s 784 pixels long. So next, we’re just going to take our input vector of 784 pixels, and we’re going to run it through two fully connected layers of 512 neurons, and then we’re going to map it to an output layer of 10 categories, the numbers 0-9. I originally set out to write this demo, but this gentleman named Huan is out in Beijing, he’s a GDE out there. He wrote this code, so I simply took his code and modified it slightly in order to produce these results.

This is what our very simple neural network is going to look like. It’s nothing more than our input layer, 784 to 512, 512 to 512 again, and then 512 to 10 output at the end. The reason we’re using these swift data sets is that now we can just define our differentiation function in this simple line right here – the compiler will take care of all the magic of actually making that happen. Let’s see what this would look like…

struct MLP: Layer {
  typealias Input = Tensor<Float>
  typealias Output = Tensor<Float>

  var flatten = Flatten<Float>()
  var dense = Dense<Float>(inputSize: 784, outputSize: 512, activation: relu)
  var innerLayer = Dense<Float>(inputSize: 512, outputSize: 512, activation: relu)
  var output = Dense<Float>(inputSize: 512, outputSize: 10, activation: softmax)
  public func callAsFunction(_ input: Input) -> Output {
    return input.sequenced(through: flatten, dense, innerLayer, output)

Here’s all the code. He got all the way down to about 40 lines, which is quite elegant. All I did was modify this bit. Now we’ll run his basic MNIST demo across the MNIST data set. I’m running this on my computer back in Missouri, but I’m SSH’d in here. This simple neural network was able to get about 94% accuracy in the MNIST data set. We’re kind of cheating because we’re using large fully connected layers, but bear with me, and I hope this approach will make sense.

Convolutions – I would love to throw one slide up here and explain convolutions to you all in one slide, but I don’t think it’s possible. But I think this slide right here – which I stole from an NVidia Deck a year ago – is about the best way I can tackle this subject. We have our input image on the back, and what we’re going to export is a blurred version of our input image. So we have this 3x3 convolutional kernel in the middle, and all it is, is the number 1. That means that for each input set of 3 pixels, our output is simply just going to be the sum of these pixels together. It’s literally 2 + 1 + 2 +1 + 1 to get 7. We take this whole window and move it over one set of pixels and repeat the process again. We keep on going until we reach the end of the row, and then we repeat, moving everything down one row. So this process of going over the image is called striding, and it’s a very important concept for you to understand. The other concept you need to understand is maxpooling. All we’re going to do is take this group of 16 pixels, and convert it to a set of 4. For each colored region, we’re going to find the largest pixel and make that be our output.


If we take these two concepts together and revisit the MNIST problem, we can actually significantly improve our quality just by changing how we’re modeling our data. We’re going to take our same 784, but we’ll treat it as an actual image, so it’ll be 28 x 28 pixels now. We’ll run it through two layers of 3x3 convolutions, a max pool operation, and then we’ll keep our same densely connected layers and output of 10 categories.

struct CNN: Layer {
  typealias Input = Tensor<Float>
  typealias Output = Tensor<Float>

  var conv1a = Conv2D<Float>(filterShape: (3, 3, 1, 32), padding: .same, activation: relu)
  var conv1b = Conv2D<Float>(filterShape: (3, 3, 32, 32), activation: relu)
  var pool1 = MaxPool2D<Float>(poolSize: (2, 2), strides: (2, 2))

  var flatten = Flatten<Float>()
  var inputLayer = Dense<Float>(inputSize: 13 * 13 * 32, outputSize: 512, activation: relu)
  var hiddenLayer = Dense<Float>(inputSize: 512, outputSize: 512, activation: relu)
  var outputLayer = Dense<Float>(inputSize: 512, outputSize: 10, activation: softmax)
  public func callAsFunction(_ input: Input) -> Output {
    let cnn_input = input.sequenced(through: conv1a, conv1b, pool1)
    return cnn_input.sequenced(through: flatten, inputLayer, hiddenLayer, outputLayer)

Here’s what the actual swift code for this looks like. I’ve taken the example from before, and added a stack of convolutions on top. Then we take our input, run it through our convolutional layer, and then send it to our same output densely connected layers as before. This will run – this goes a little bit slow; I didn’t quite install everything in the optimized manner. But eventually we’ll run and we’ll get up to about 97% accuracy on the MNIST data set. So by simply changing how we model the data and using convolutions, we’ll be able to cut our errors in half on this toy problem.


Where do we go from here? Let’s take on a slightly larger, more complicated problem. This is a data set called CIFAR. It’s a collection of color pictures. So we have pictures of cats, dogs, animals, as well as human vehicles, so cars and trucks. We have ten categories. Now we’re going to be working with color data, so we have an RGB component. Our same basic approach that we used before, we can scale it up to tackle this problem. So we’ll simply take our input data – 32 by 32 by 3 channels, we’ll run it through two sets of convolutions, a max pool, two more sets of convolutions, a max pool, our same two densely connected layers and then we’ll have ten categories for our outputs.

struct CIFARModel: Layer {
    typealias Input = Tensor<Float>
    typealias Output = Tensor<Float>

    var conv1a = Conv2D<Float>(filterShape: (3, 3, 3, 32), padding: .same, activation: relu)
    var conv1b = Conv2D<Float>(filterShape: (3, 3, 32, 32), activation: relu)
    var pool1 = MaxPool2D<Float>(poolSize: (2, 2), strides: (2, 2))

    var conv2a = Conv2D<Float>(filterShape: (3, 3, 32, 64), padding: .same, activation: relu)
    var conv2b = Conv2D<Float>(filterShape: (3, 3, 64, 64), activation: relu)
    var pool2 = MaxPool2D<Float>(poolSize: (2, 2), strides: (2, 2))

    var flatten = Flatten<Float>()
    var dense1 = Dense<Float>(inputSize: 6 * 6 * 64, outputSize: 512, activation: relu)
    var dense2 = Dense<Float>(inputSize: 512, outputSize: 512, activation: relu)
    var output = Dense<Float>(inputSize: 512, outputSize: 10, activation: identity)

    func callAsFunction(_ input: Input) -> Output {
        let conv1 = input.sequenced(through: conv1a, conv1b, pool1)
        let conv2 = conv1.sequenced(through: conv2a, conv2b, pool2)
        return conv2.sequenced(through: flatten, dense1, dense2, output)

Here’s what this model looks like. We’ve done nothing more really than add another stack of convolutions. If you look at the very first line, 3 by 3 by 3 by 32, that’s where we introduce color. It didn’t make our network that much more complicated. So for this one, I took the CIFAR demo from the Swift Tensorflow / Swift Models repository and I just replaced and put my model in there over that, and then we ran it. That will look like this. I’ll let it run – it will take a while to run. Eventually we’ll end up with a network with around 70% accuracy. That’s not going to allow you to write a paper any time soon, but this approach does technically work.

We might look at this thing and think, “Let’s keep stacking more convolutions.” I think if you could jump in a time machine and go back in time five years, you could then be the world’s foremost expert in computer vision. This is a VGG network from 2014 or so, and it’s nothing more complicated than the things I’ve shown you so far. We’re dealing with the ImageNet data set, so we have a slightly larger input of 224 by 224 pixels. We take our input, two layers of 3 by 3 convolutions, a maxpool, two more layers of 3 by 3 convolutions. We’re looking at the VGG 19, so we have four layers of 3 by 3 convolutions, maxpool, four layers of 3 by 3 convolutions, maxpool, four layers of 3 by 3 convolutions, maxpool, using a slightly larger dense layer – we’re using 512 for the two demos before. This one is simply 4k – 4096. Then ImageNet has a thousand categories, so we have a thousand output nodes at the end.


So we take this, and let’s say we apply one more mental leap on top. Rather than think of this as 2, 2, 4, 4, 4 – let’s think of this as one set of 2 layers, one set of 2 layers, two sets of two layers, two sets of two layers, and two sets of two layers. If you can do that step, then we can jump over here to resnet, which is our first solid modern approach in this field. On the left side, we have the same VGG network that we were looking at before. So 2, 2, 4, 4, 4. In the middle, we have the backbone of what’s called resnet 34. But it’s conceptually no more complicated than anything we’ve looked at thus far. We have three sets of these two 3 by 3 layers, four sets of these two 3 by 3 layers, six sets of these two 3 by 3 layers, three sets of these two 3 by 3 layers, and then we have our output layer. The magic of residual networks is this dotted line that’s being drawn down here on the side. Basically, the problem with the VGG approach is that these convolutional approaches are not very resistant to noise. It’s actually about as big of a network as you can make. The problem is, if each layer only introduces .1% noise, by the time you get through 19 layers, that’s going to significantly affect your results. So resnet has introduced this concept of skip connection – so neural networks are kind of extremely lazy, so if they can find an answer, basically they’ll shortcut everything else. The power of these residual networks is that you can stack layers and layers of convolutions until you find something that sort of overfits your problem, and then you can dial it back to produce a simplified best case answer, in theory.

That is resnet 34. We need to do one more trick – we need to go away from our 3 by 3 convolutions. If we look here in this other quadrant, we’re going to replace our two 3 by 3 layers with a 1 by 1, 3 by 3, 1 by 1 style approach. So 3 and 4, 6 and 3 is 16, times two is 32, plus a head and output layer – that’s resnet 34 – the same 16 times 3 plus head and output is 48 plus two, so this is resnet 50. So let’s do a quick demo of training resnet 50 on the ImageNet dataset using a cloud TPU.

ctpu up --name=resnet-imagenet

export STORAGE_BUCKET=gs://imagenet-models
export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"

python \
  --tpu=${TPU_NAME} \
  --data_dir=gs://imagenet-tfrecord-data \
  --model_dir=${STORAGE_BUCKET}/resnet \

What we’ll need to do first is create a cloud TPU. That’s simply running this command. I did this ten minutes ago, so we won’t have to watch it get started. Here we have a cloud TPU running up in the cloud. We’ll start this process and it will spit out a whole bunch of line noise, it will say a lot of warnings about Tensorflow 2… But if we wait about 10 to 12 hours, this will output a resnet 50 trained on the ImageNet data set.

So where do we go from here? Don’t let the 2015 on this slide fool you. This resnet 50 is more or less probably your best first bet for most computer vision problems still today. Many people have come up with different networks, some of them are technically better or technically produce slightly better results, but more often than not, you should come back to this basic model for your first approach.

Let’s look at these bottleneck blocks a little bit more. I would argue that this 1x3x1 approach is not as powerful as the 3 by 3 approach that we’ve looked at so far. The reason this bottleneck layer has better results is hidden in this 256 that’s shown on the last layer. This 1x3x1 – the last layer is technically four times as large as the other stuff. So basically, I would argue that this bottleneck layer is not as powerful as the 3 by 3 approach. However, it’s cheaper. Because it’s cheaper, we can run more of it. Because we can run more of it, that’s ultimately why this approach produces better results. So in order to replace resnet, we need something that’s not necessarily better – we need something that’s actually cheaper. Or to use a slightly different word, we’ll say more efficient.


This is a paper that came out in May of this year. It’s a culmination of several years of research by the Google team. Effectively, people have tried to build larger network widths, people have tried to build deeper networks, and people have tried to build larger networks in terms of the size of the inputs. But no one has found the perfect combination. What they did in this paper is they took the mnas approach from mnasnet from last year, they added in some different layer types from other cutting edge networks (eg mobilenetv2), and they effectively left the computer and let it search across all this parameter space in order to find the most optimal set of networks. They’ve done similar things before in the past, notably with nasnet and then the AmoebaNet papers from last year – but what’s interesting about this paper is that they are applying human intuition and logic on top. They’ve come up with a formula whereby if you come up with one network, they can basically multiply the parameters in your network in order to produce larger versions of it. This is really cool because a lot of times the reinforcement learning stuff, you end up with networks that only computers understand – whereas this is like humans adding another layer of intuition on top, so it’s working even better.

That brings us to efficientnet-edgeTPU. We can think of our search space as being accuracy, or quality of our models – but we can also model our search space differently. We can say, “What is our latency? How long does this network take to run? How large is our network? How many different operations are we using? How many individual parameters?” So they have these edgeTPU devices, which Google has been shipping out, they are $75 or so that you can buy. Then they gave this efficientnet-edgeTPU hardware type, and say it produces the best type of network for this particular device. So we have a 1 by 1 convolution combined with a 3 by 3 convolution, and the network found that by combining these two together into a larger 3 by 3 convolution, you can actually produce better results in a faster amount of time. We have up here our resnet 50 model, and you can see we have what we call the holy grail of image recognition search – we have a network that’s smaller, faster, and more accurate, which is all you can really ask for.

ctpu up --machine-type n1-standard-8 --name=efficientnet-edgetpu-s-v3 --tf-version=nightly --zone=us-central1-a --tpu-size=v3-8

export STORAGE_BUCKET=gs://imagenet-models
export PYTHONPATH="$PYTHONPATH:/usr/share/tpu/models"

cd /usr/share/tpu/models/official/efficientnet/

python \
  --tpu=${TPU_NAME} \
  --data_dir=gs://imagenet-tfrecord-data \
  --model_dir=${STORAGE_BUCKET}/efficientnet-edgetpu-s-v3-mlsummit \
  --model_name='efficientnet-edgetpu-S' \
  --skip_host_call=true \
  --train_batch_size=4096 \

We’re going to demo running efficientnet-edgeTPU-S on an actual edgeTPU device. We’re going to use a TPU3 instead of a TPU2, so that’s in this command here. The second trick we need is that this is all a little bit bleeding edge, so we have to use a nightly build of Tensorflow so we tell the computer to do that right here. Next we have a bunch of parameters, but basically very similar to our resnet command. Here’s my cloud TPUv3 running, and we’ll copy and paste the command in here and give it a few seconds to get going. Now we’re training edge-efficientnet-edgeTPU-S in the cloud on a TPUv3. This will take about 30 hours to run, but at the end, we’ll have produced the checkpoint.

gsutil cp -r gs://imagenet-models/efficientnet-edgetpu-s-v3-fixdir/* .

export MODEL=efficientnet-edgetpu-S
python --model_name=$MODEL --ckpt_dir=/home/skoonce/edgetpu/test-v3/ --data_dir=/home/skoonce/edgetpu/validation/ --output_tflite=/home/skoonce/edgetpu/${MODEL}_quant-v3-demo.tflite

Next we’ll just literally copy this checkpoint from our remote server down to my local machine. The edgeTPU device uses int8 math, whereas the cloud is using floating point. So we need to quantize our model – so convert from floating point into int8. So for this, we’ll use another script that the edgeTPU people have provided. The only fun part of getting this working is that this relies on the Tensorflow XLA ops, which are not installed by default in the Tensorflow builds, so you have to compile it from source. This takes about a minute or so to run. We’ll have a quantized checkpoint of our efficientnet-edgeTPU build.

cd /usr/local/lib/python3.6/dist-packages/edgetpu/demo

python3 \
--model /home/skoonce/edgetpu/efficientnet-edgetpu-S_quant-v3-demo.tflite \
--label ~/edgetpu/edgetpu/imagenet_labels_mod_one.txt \
--image ~/edgetpu/edgetpu/panda.jpg

Then, we just need to run our model locally using an edgeTPU device. I got a panda off of Wikipedia, I’m using that for input. As you can see, it thinks we have a panda with 60% probability, but it might also be a fox with a 12% probability.


Our goal was to explore the concept of convolutional neural network to perform image recognition, towards that end, we built a one dimensional neural network, we added convolutions, and then we approached the MNIST problem again using a 2D approach. From there, we looked to how we could stack blocks up in order to tackle larger and more complicated problems in this field. Then we looked at how we could introduce residual layers, and then finally begin to actually modify our different block types in order to produce a state of the art approach in this field.

I’ve talked a lot about images up here, but many of the more interesting applications of CNNs are in completely different fields. So we can add another layer on top of our 2D CNN in order to get a 3D CNN, we can use this to start to tackle depth data, like LIDAR. People have taken language models and converted them into the CNN style approaches, so QANet was an interesting paper from last year where they did that. Planet detection – they can take a 1D CNN approach and do some other tricks on top in order to begin to detect exoplanets. AstroNet was a really interesting paper in this field. The AlphaFold paper came out earlier this year. They use a combination of 1D, 2D, and 3D neural networks together in order to significantly advance the state of the art protein modeling. Then finally the ever popular AlphaGo and AlphaZero engines. Originally I tried to put up a little bit of each of these papers up here, but this slide got a little bit busy. I reduced it back down to this one picture. What you’re looking at is the inner layer of the AlphaZero engine, which is composed of 40 of these residual blocks that you’re looking at right here. This AlphaZero block is composed literally of a residual layer, the same approach we looked at before, with two pairs of 3 by 3 convolutions. So the same approaches that we’ve used to do our image recognition can in a completely different domain, plus a whole bunch of reinforcement learning, be used to solve the game of Go.

That’s all I got. Thanks for coming!