convolutional neural networks with swift for tensorflow: overview


Did an overview of convolutional neural networks with swift for tensorflow for the swift-sig group, talking about how I structured the book and approached writing it.


Thank you for letting me talk today, and I thank everyone else for joining today. I’ve written this book, Convolutional Neural Networks with Swift for TensorFlow – I’ve been working on it for about a year now. I recently shipped it off to the publishers. It’s not quite over the finish line, but later this year, you should be able to get your hands on a copy. I thought today I would walk you all through the primary sequence of the book, just how I structure things. Then we’ll look at some stuff I ran into along the way, and maybe some pain points we might look at next.

So at a high level, we’ll look at the problem of image recognition as a single problem, through the lens of Swift for TensorFlow. Towards that end, we’ll look at neural networks and convolutions and how they can be used together to solve this problem. From there, we’ll look at some simpler versions like MNIST and CIFAR, and how we can build upon this base and tackle problems like VGG and Resnet, which are full-blown image-net level image recognition networks. And we can start to look at the MobileNet family of networks, and then research that leads up to EfficientNet, which is pretty close to the current state of the art in this field.


For that though, I thought I would talk about what was going through my head as I was trying to do all of this. I have this weird sort of intersections of these various techniques. I was at WWDC whenever they announced Swift a while back. I remember there was this weird energy in the air – everybody knew that something had changed, but nobody quite knew what had happened. I didn’t actually start doing Swift back then, but I think later on that year, they had an app where we wanted extensions in it. So as a result we moved to a new version of iOS, so then I snuck my first Swift file into the project, and since then I’ve been doing it pretty regularly.

I’ve done a lot of Unix, I used to help maintain an open-source project called Homebrew. Some of you may have used that. I think normal people see a wall of shell scripts and they run away, whereas, for me, I’m like, “Oh, good! Here’s something I can play with and get working on my end.”

Then I’ve been really interested in this whole field of deep learning in general. I studied a little bit of AI in school, but it was mostly traditional techniques, you could say. Markov models, SAT solvers and things like that. So this whole deep learning thing, I’ve had to rebuild my understanding of things from the foundations. Towards that end, I kind of like going through a lot of the online courses out there. I think I tried to have a beginner mindset. There’s a lot of things in writing this book where I thought I knew what I was talking about, but whenever you try to convert it into words, you start to realize that it’s not quite as clear as you actually thought it was. So I think that it’s important to always be asking why.

At the second level of this, I think then is the Swift for TensorFlow project itself. I love this idea that we’re rethinking how all this stuff works together, the foundations, and everything like that. People say that X framework is the future. It’s like, well, this field is barely even a few years old. I think it’s way too early to be predicting where things are going to be at even in a decade from now.

A lot of technical books that I’ve found, they take what I would call a shotgun approach. They say to do a GAN or do an RNN network, and they did convolutional networks, and then voilà, you suddenly understand neural networks somehow. That’s not really the style that I’ve had much luck with. I just like to really understand one problem, and understand it well. So towards that end, like I said, I just took this image recognition problem and I tried to go about it as deep as is possible.

One of the crucial things in general in Deep Learning or even in other fields is being able to play with things, dabble with cheap experiments – that, to me, is the process by which you gain mastery. By continually messing with things to understand how they actually work together. The cool thing to me about computer vision in particular and image recognition is that you don’t really actually have to have a fancy computer to do it. You don’t really have to have everything working 100% in order to be able to play with these things. So even though Swift for TensorFlow is still in a 0.X release, we’ll say, for what I was trying to do, I thought it was more than sufficient in solving all my problems there.

Then the other thing I feel like I’ve found is that convolutional neural networks in general are like a really good foundational technique. People will learn them quickly and then try to jump off to other things. But I’ve found with the more and more I study convolutional neural networks, all of a sudden other techniques start to become much easier for me to understand. Like reinforcement learning or NLP, things like that. So I think it’s a really good spot for people to get started.

These are the four traditional areas of computer vision. We’re just going to focus on image recognition, which is basically just deciding if something is a cat or a dog picture. I’ve used this slide a whole bunch in a bunch of my presentations, but I really think there is a lot that can be unpacked here. Perceptrons are from the 1950s or so, neural networks are not really as new as many people think. But basically, they found that trying to literally go from an input to an output didn’t really work because the data is too complicated. So then they added this layer of indirection, this layer of hidden nodes, which is how you make dense networks – Feed Forward networks.

Convolutional Neural networks

Then if one layer of indirection is good, then two layers of indirection is better. So you have this deep Feed Forward pattern of an input, two layers of dense nodes, and then an output. I find that you’ll see this pattern a lot. Like in NLP, oftentimes there will be an input, two layers of LSTM modules and then an output. So I think this basic pattern is a good one to know. Then from there, we can just add convolutions on the neck on top, and then you have a full-blown deep convolutional neural network. So look at this Deep Feed Forward, and then this Deep Convolutional Neural Network, because that’s literally what we’re going to do next.

We can take the MNIST problem, which is just a very simple black and white dataset. We can take the data at each level and unroll it, just convert it to a long string of numbers – but if we take this like this and we run it through two layers of densely connected nodes, we can actually make a really good categorizer for our digits. From there we can get into convolutions.

This is the best slide I’ve found for trying to explain convolutions to people. We literally take our source data, this batch of pixels and for each group of 3 by 3 pixels, we add them all together and then we output the result. Then we just simply step over the picture layer by layer until we have a new output image. You don’t even really have to understand this additive stuff, because that’s what the neural network is going to learn for us. Then the other trick we can put on top is this max-pooling stuff. I think this is reasonably easy to understand as well. We just take a group of pixels, say these red pixels, pick the largest one out, and then send that one out to the new layer.

Then we can go back to our MNIST problem and revisit it using convolutions. Now we just simply take our input and add it through two layers of 3x3 convolutions, a maxpool, then we have our same two densely connected layers, and then our output layer. Now we’ve built ourselves a very simple convolutional neural network with which to perform image recognition.

The thing is, if you’ve made it this far, I actually think that’s the hardest part of understanding this field. CIFAR is a slightly larger data set that’s composed of real-world pictures of animals and vehicles, but we can tackle it using just a slightly larger version of the network that we just looked at before. So now we have two layers of the 3x3 nodes, a maxpool, two layers of 3x3 nodes, a max pool, and then our same densely connected layers to an output. This isn’t a state of the art approach, but conceptually, it works.

From there, we can just start to build bigger and bigger networks. This is the VGG network, but it’s literally no more complicated than the things that we’ve looked at before. We’ve just added more and more layers of these 3x3 convolutions and maxpools, I did the VGG 16 which has like 3 nodes, but I actually like this VGG 19, because then we can start to go to the next level of thinking of each layer as perhaps being a set of layers. So we might have one set of two 3x3 nodes, another set of two 3x3 nodes, two sets of another two sets of 3x3 nodes.

The power of this approach is that we can jump up to a network like ResNet. The backbone of the ResNet 34 network is nothing more than sets of 3x3 convolutions, the same as the techniques I’ve shown before. We put three layers, four, six, and three, down the side. Then the other trick that the residual networks had is this idea of skip connections going in between layers. Then finally, we have to go away from the 3x3 nodes, so if you look over at the left we replace our two 3x3 nodes with this 1x1, 3x3, 1x1 approach and you replace all the nodes in our network like this – and that produces ResNet 50, which is a really important network for people to know.

So everything I’ve shown you so far is the first six chapters of the book or so, we’ll say. Just trying to get people up to speed with this sort of technique.

Mobile networks

After that, we start looking at mobile networks – trying to do things more efficiently with our data, which ultimately will allow us to produce even better networks. I did a talk three years ago on Mobilenet V1. If you’re interested in that, you might look at that. The key concept of MobileNet is this idea of these depthwise combined with pointwise convolutions. It’s kind of a tricky concept to explain, but the basic idea is that since we’re doing these sets of operations on layers, we can kind of group them together in order to make them run faster, and by extension, run them on a mobile device. This is a medium article I’ve found that I thought was really a good explanation of what’s going on, so you might check that out.

I did a talk a couple of years ago about MobileNet V2, but I didn’t really break down how it worked. But basically, it builds on the MobileNet V1 architecture, it adds this concept of linear activations and inverted bottleneck layers. I think the linear activations piece is actually pretty easy to understand. Basically, they discovered that they were running the network through a ReLU at the last spot, and it didn’t actually need to be done that way. So they simply deleted that ReLU and that’s what they’re calling a linear activation.

Now, the inverted bottleneck layers are a little bit harder to explain. But basically the ResNet we were looking at before, we go from the top of the layer to the bottom. This network goes from right below the top to above the bottom. This has the side effect of maybe a little bit less data goes through the network, but the flipside then is that it’s a little bit more computationally cheap, so we can add in some more layers. So as a result, this MobileNet V2 architecture can actually produce even better results than our MobileNet V1 just by slightly tweaking how we put layers together. Once again there’s a really nice blog post on this subject if you’re interested in more than that.

Two years ago I talked about MNasNet and how I thought that these evolutionary strategies were going to become more prevalent in the future, but even I underestimated that. It’s wild to me that this slide here is comparing MNasNet with MobileNet V2 – but the most important part of the MNasNet paper turned out to be the search strategy they were using to find these networks. So they literally took MNasNet and used the MobileNet V2 blocks together, and then threw in some concepts from the SENet paper which also came out a couple of years ago – and with these concepts together, they were able to produce EfficientNet, which is in my eyes the current state of the art in this field.

But conceptually what we do is we just take a baseline network that we know works and we can give the computer a whole bunch of variants that are either wider, deeper, thicker or a higher resolution we’ll say – and then we can let the computer search over all of the search space of these different network pieces in order to find the most optimal set that all work together. So this is a really important paper to understand that came out last year. Then from there, we can go back down to another mobile network. The core of Mobilenet V3, basically there was some stuff in the EfficientNet that used a bit too much memory and processing time to actually be doable on a mobile device. So they added simplified versions of that. But the core part of MobileNet is very much just EfficientNet with an even more constrained search space.

This is the output logic of MobileNet V3, and I thought this was interesting. Basically, they took this original last stage up here, which is kind of like more of the MobileNet V2 style of approach we might say, but they found that this simplified set of 1x1 convolutions was able to produce extremely good results, and it’s also extremely performant. This network is cool to me because we have this super tiny, super-fast network, and then we can start to use it to build other things. This is another slide from the paper here at the bottom where they took the MobileNet V3 as a base, and then they added a full-blown segmentation network and had that on top in order to produce a state of the art segmentation network that can actually be run on phones.

State of the art

Then from there, you can get into the full-blown state of the art approaches. Algorithms are nice, but at the end of the day, it’s also really good to have lots of data. So this Facebook paper from last year where they took like a billion images from Instagram, and use that to build a really large image recognition network, it’s an interesting paper for you to look at. We can go from big data into data augmentation strategies, so this RandAugment paper from last fall where they use basically evolutionary strategies to find an optimal set of augmentations, it’s a new interesting paper. Basically, by adding pre-processing layers to their data, they will be able to improve the accuracy of their networks by 4 or 5% just by tweaking the data that went in. I didn’t talk much about object detection networks, but conceptually we can take an EfficientNet and put an object detection head on top of it, and voilà, we’ve produced a state of the art object detection network. So this is another interesting paper to look at as well, EfficientDet.

Then over in the field of the NLP, model distillation is an interesting technique to try to make big networks small. But this Noisy Teacher is a really interesting paper where they used a whole bunch of TPU time in order to make small networks large. So they were able to train even larger versions of EfficientNet just by using this sort of strategy, and I thought that was a really interesting paper to look at as well.

Anyway, that’s mostly what the primary sequence of my book is. It’s kind of like how I structured it, so I hope this all conceptually makes sense to you. I’d like to thank these people at Apress for helping to guide me through the process. This friend of mine named James Maki who read an early version of things and gave me a bunch of feedback on tone and how I was structuring things, so I want to say thank you to him. I had some questions about TPUs and Brennan answered those, so thank you to Brennan. And I want to say thank you to my parents, and Quarkworks – my company for keeping the lights on, so to speak. Then, at the risk of making a bad joke, I’d like to put a shout out to the Covid virus for keeping me locked up for a few months this year, which did wonders for my productivity.

Next steps

Last year I did a presentation with you all, and I had a wishlist of some stuff I thought would be interesting to tackle next. Checkpointing – I want to say thank you to Brad for banging on that. I know that model serialization isn’t super fun, but it’s a really important piece of plumbing, so it’s cool that we have that now. I think this whole TPU training piece is an interesting power of this whole Swift for TensorFlow approach, this idea that you can sort of code something for a single TPU and then write it on a pod to literally have thousands of cores on demand is something that’s really interesting, and I think that’s where we should be looking in the future.

Pytorch has recently been working on adding mixed precision to their libraries, and I think it would be interesting to think about the best way to add it to Swift for TensorFlow. I think bfloat16 in particular as a data type, it’s really interesting. I think part of the reason why people have not adopted TPUs as much as would be desired is that they don’t have a way to run TPU code locally to build and test it. So with NVIDIA getting Ampere to market here shortly, I think it would be super interesting, the idea that you can build bfloat16 code on your home workstation, and then ship it off to the cloud to run there and in theory, the numerical stability and whatnot should all be the same. Intel was supposedly going to ship Cooper Lake this year, but I think that’s been back burnered. In their next-generation architecture app next year, we’ll be able to have bfloat16 hardware at the CPU level as well. So I think this is a really interesting thing to try and get on top of in the next upcoming year. Then in general, I think the existing data pipelines work really well. I think as you start to scale something up to a TPU pod, I think that there’s probably going to be a whole bunch of profiling and optimization that can be done in order to increase the speed there. So I think it would be interesting to start thinking about the best way to get that piece going. Then with that, I’ll say thank you all for listening!

Q + A

Ewa: Thank you, Brett. There are some questions for you in the chat, I’m just going through them… Rhett asked, “Are all the papers mentioned here gone over in the book?”

I mentioned most of them, but I didn’t implement them all. But I did implement the primary sequence of the MobileNet networks, EfficientNet, VGG and Resnet.

Ewa: Here’s a question… “You’ve added a number of models to Swift models like VGG16, VGG19, MobileNet, EfficientNet, probably more… How was your experience at adding these to Swift for TensorFlow and writing them in Swift, versus Python?” and, “What item models do you think are missing from the Swift models repository?”

Sure. I was doing a lot of Pytorch a year ago, so whenever I initially started trying to write stuff for Swift, I was trying to bring Pytorch models and logic over. But basically, I found that I ran into a lot of speed bumps there. The two networks would make subtly different assumptions about how things worked, and then things would break. So somewhere halfway through the year, I switched to working off Keras based models, and that made my life much simpler. Even if Keras makes an assumption about how TensorFlow works, then Swift for TensorFlow by extension will oftentimes make the same assumptions. So after that, my life got a lot easier. As far as other networks that would be interesting to implement, there are some other smaller networks that are kind of interesting. SENet would be kind of cool to sneak in there. But in general, I think the thing that would be interesting to me would be to get ImageNet training demo working to where we could literally be training these networks and bootstrapping the scaling process that way.

Ewa: Thank you. Althaus has some nice complimentary words, “Can’t wait to receive the book once it’s out.” He also said he hopes to be a GitHub repository, with the models and Swift Notebooks, is that your plan for the book?

Yeah, I have all the code together. I haven’t published it, but it’s definitely on my list of to-dos. But yeah, first we have to get the book out the door, so to speak.

Ewa: Alright. Brad answered your question about mixed precision, he said that that mixed precision with bfloat16 is present under the hood in x10/TPU right now, but we don’t have support for GPU’s yet. Then also again, they’re asking you to compare your experience in writing in Python versus Swift? Maybe you already answered that?

No, with respect to Python and Swift… I’ve done a lot of Python, and I’ve done a decent amount of Swift at this point in time. To me what’s cool about using Swift is that you kind of get this type Safety. And so it’s made me willing to make changes. Like, making changes in Python code, you can change one line and basically… You have to go through a whole bunch of steps and then something will crash, whereas with Swift for TensorFlow, I’ll oftentimes find myself changing things and refactoring them slightly, and then relying upon the compiler to throw errors, and I’ll fix things until finally, the compiler is happy – at which point oftentimes I’m extremely confident the code is going to work. Making the compiler happy is the hard part, but that little bit of cost upfront is way better than having the uncertainty about some Python gremlin giving me trouble because I didn’t do everything perfectly over there.

Ewa: Some follow up questions about that – speaking of Type Safety benefits of Swift, do you do your model build on Xcode or on Linux?

I’ve been doing everything with Linux 18.04, Cuda 10.2, then I have some NVIDIA GPUs. It would be cool to have Ubuntu 20.04 we’ll say, but I know there’s a whole bunch of stuff that has to happen for that to happen. Then the same thing, it would be cool to have Cuda 11, but a whole bunch of stuff has to happen for that to happen as well. It’s my understanding, that’s part of the TensorFlow 2.4 roadmap, and so hopefully once that gets sorted, Swift for TensorFlow can be brought into parity with that and we can solve that problem. I have a very wacky workflow where I sort of use Cyberduck to open files in TextMate, which is just an old Mac editor, probably some of you know. So then I save stuff, it saves over the network, and then I have like a tmux session open, and I make it run again over there. But every now and then, things will get out of sync, and your brain will hurt.

Ewa: Thanks. We have a couple of questions from Michael… He asks, “Will you be providing checkpoints of trade models on there?”

No, most of my demos are just based around the ImageNette dataset, which you can run locally in an hour or two on your home machine. So eventually yeah, if we had a full-blown ImageNet proper dataset, that would be an interesting thing to do. But I did not make checkpoints, no.

Ewa: Another question from Michael, “Do you cover tasks like object detection and Segmentation in your book?

No, I only covered image recognition. I mentioned these other areas, once you build up to a state of the art image recognition network, these other things, there are ways to jump into other fields.

Ewa: When will the book be released?

They’re going to send me the proofs, supposedly it’s happening any day now. Then I need to make a few tweaks, and I ship it back. Then hopefully maybe next month, but cross fingers, knock on woods, et cetera that it will actually be out the door.

Ewa: Robin asks, “Are there any particular resources you recommend for building Swift skills quickly?”

You can load up the playgrounds in Xcode, I think that’s just kind of interesting to get this idea of having a full-blown REPL working. But I think really, I’m an iOS developer, and I think just doing a very basic iOS app and running some Hello World tutorials over there is a really good way to understand how the whole iOS system works together, and I think a lot of these paradigms for Swift, in general, will come out of the NextStep libraries. So if you can structure things in a way that sort of matches that, that’s a really good solid foundation for Swift in general.

Ewa: I was just going to say, you have Swift guides and there’s a “Getting Started” notebook that you can launch in Xcode in Playgrounds, and Playgrounds is like an interactive environment in X code that works for Swift. So that’s a great place to get started as well.

Yeah, the Colab notebooks are really nice too, because if you’re a beginner, it really simplifies the provisioning an environment process way down, so you can just focus on the code.

Ewa: Some more questions here asking for a little bit more details about your Linux IDE that you’re using? Are you using VScode or emacs, or anything?

Like I said, TextMate on a Mac. Sorry. Or Nano, maybe. No, I do a little bit of vi every now and then, but I’m by no means an expert for sure.

Ewa: I personally use LLDB. I know some of the people on our team also use VSCode, I don’t know if anyone wants to speak to their experience in VS Code on this thread. But yeah, I think the tools on Linux could be improved for sure. Then Brad asks, “Swift for TensorFlow evolves regularly, how does Apress manage updates if the syntax needs to be changed in examples? Have you found it difficult to keep up with the changes as you’re writing the book?”

Yeah, I think that’s an open problem. It’s definitely going to evolve underneath it. Most of my demos, I did them around the epochs APIs, so it should be future proof for the near future. I think part of the idea of posting some code on the internet is that I can sort of periodically update it to match whatever latest Swift for TensorFlow version is best. So hopefully I can mitigate that to some deal, but definitely, that’s going to be an issue.

Ewa: Thank you. Did anyone else have any questions they would like to bring up for Brett? By the way, Brad says he uses VS Code regularly. It works great as a remote editor via the SSH extension. Dave uses Linux. So I think that’s it… One more question, here we go – “Do you explain transfer learning in your book?

No, I didn’t do transfer learning either. I’m just building generic image recognition networks.

Ewa: Okay. Thank you so much, I can’t wait to see the book. That’s awesome – and let us know if there is anything we can do to help, of course.

I think it’s a cool project, I enjoy banging around with it, and I think doing all of this has made me go back to the basics, so to speak. So I actually feel like it’s dramatically improved my neural network knowledge in general after messing with this for the past year or so.

Ewa: Thank you for writing those models and contributing them to the project!