introduction to swift for tensorflow (2020)

Oct 21, 2020 in INTRO • S4TF
30 min read

An introduction to swift for tensorflow for Google ML Summit 2020 (10/21), GDG DevFest Cloud Kampala, Uganda (10/24)

Abstract

We will use Swift for TensorFlow to build a simple neural network and use it to categorize MNIST digits, then look at how we can extend our approach to run on different hardware using Google’s cloud. Along the way, we will look at how Swift works with the LLVM compiler and automatic differentiation to make it easier to reason about our code.

Video • Slides

Transcript

Hello, thank you for coming, and thanks for having me. Today we’re going to talk about Swift for TensorFlow.

We’re not going to do anything more technically complicated than run through an MNIST demo a few times. The purpose of this presentation is to try and help you understand what exactly is happening under the hood whenever we run our code. Swift for TensorFlow is composed of a few different pieces – the Swift programming language itself, LLVM compiler, neural networks, some auto differentiation, and the XLA TensorFlow library. My hope is that you understand these pieces and how they fit together.

Our demos will run on various different hardware – CPUs, GPUs, and ultimately a TPU in the cloud. Then at the end, we’ll do a quick recap and I’ll leave you with some next steps – where to go from here.

Overview

At a high level, these are some of the components that make up Swift for TensorFlow proper. A large portion of the project was started as a branch of the Swift programming language – about half that code has been upstreamed into Swift proper, so it’s part of the 5.3 release. You can download that part today. The team is working on getting the rest merged eventually. We’ll also use some pieces of the Swift Package manager so we can pull down some other code and useful tools from the internet.

The LLVM compiler is a large piece of Swift’s power, so we’ll look a little bit at that and how that works. Then the Swift APIs proper is kind of the core for Swift for TensorFlow. This is a set of neural network operators, although perhaps more broadly, you might think of them as being able to do math in general. The big thing that Swift for TensorFlow gives you under the hood is this idea of automatic differentiation. This is a really powerful tool that allows us to simplify building complicated neural networks, so we’ll try to break down how exactly that works.

Then we’ll look a little bit at the accelerated linear algebra library, or XLA as it’s called. This is part of the larger TensorFlow Project, and so by using it, we can bridge into the larger TensorFlow world, and by extension, we can use this run our code on new and exciting types of hardware.

Before we even talk about Swift though, I think we have to talk about the motivations for its existence. So at one point in time, all programming was done in Assembly – out of this Cambrian era we’ll say, there were a number of interesting ideas that came out. But one of the big ones was the C programming language. This let people code in a slightly higher-level language, but then they could recompile their code to run on different platforms. This really in turn begat the whole Unix explosion and many of the concepts that we consider to be the primitives of computing today.

There are a number of high-level languages that came out of this era. One of the interesting ones is a programming language called Smalltalk. In order to simplify building larger and larger programs, Smalltalk kind of popularized this idea of message passaging – loosely decoupled actors with a shared protocol for talking together. So a company named NextStep really liked this Smalltalk concept, so they wanted to try and code with it all the time. But in the process of trying to bring Smalltalk to run on their Macintosh computers, they discovered that there were some limitations to how fast they could make things go. So they created a set of Smalltalk bindings for the C language called Objective C, and this was [post acquisition] the backbone of Macintosh programming in general for many years.

The common theme, whenever we look at this sort of history of programming languages is the idea of speed versus safety. We have languages that can go basically as fast as possible, but they have sharp edges. Things could break, programs can crash, and all sorts of interesting states can be achieved. We can try to add safety checks and layers of abstraction in order to make it easier to build and reason about our code, but oftentimes, these sorts of tools have a cost.

On personal computers, oftentimes we can just build bigger and faster machines and we don’t have to think as much about this. But incredibly in the real world, embedded and edge devices and the usage of resources is a really big deal. We might think that the difference between one megabyte and two megabytes of RAM is not really that big of a difference. But for a product that’s going to be shipped out to a million things that need to run on their own for years on end, this could be the difference between profitability or not. So as a result, whenever you start shipping projects to the real world, you oftentimes have to become extremely concerned with conserving this kind of space and resources and stuff like that.

I think really the big driver of Swift is simply Apple bringing the iPhone to market. The phone is basically the ultimate embedded device. It’s basically a computer that fits in your pocket. But in order to do this, to run firmware updates in the field, or to work on networks where things don’t always work perfectly, all of a sudden these sorts of limitations really become a big deal. So as a result, I think there’s a strong internal interest for Apple to find something better than Objective C to work with going forward, and as a result, I think that was a big part of the push for why they introduced the Swift programming language in 2014.

Swift is actually in many ways a fairly pragmatic language. It’s designed to work with existing C and objective C codebases side by side. So this is nice because we don’t literally throw away 50 years of work that’s got us here. After about a year, Apple converted Swift into an open-source project. So in 2015, they published it on the internet. There’s a lot of stuff that happened in the first couple of years out there, but we’ll say that out of the chaos, an order emerged.

Now there is a strong and steady cadence of new releases, new features, and RFPs in order to improve things. You might think of Swift as being only iOS, or perhaps just a Mac programming language, but you can actually run it on Linux as well. A number of our demos today are going to be on Linux, and then as of the recent 5.3 Swift release, Windows now has its own port. So basically any sort of programmer around the world can be brought into the fold.

But the big key concept really that Swift brings is functional programming to this whole ecosystem – but it does so in a fairly pragmatic way. It doesn’t really force you to really go all-in on functional programming as some other tools like perhaps Haskell might do. And so as a result we have type safety, which is an incredibly powerful concept which we can use to build neural networks, as we’ll see shortly.

LLVM compiler is what Swift uses under the hood in order to generate its code. LLVM compiler is really cool in how the whole ecosystem has blown up around this, and so we have all these other programming languages that are using LLVM as well. As a result, there are all these different groups with different ideas and programming styles that are pushing the forefront of what is possible. We might think of TensorFlow as being a neural network library, but in reality, if you get deep into it, it’s basically a domain-specific language for generating TensorFlow graphs, we’ll say – and so by extension, it can go directly into LLVM because it’s basically its own programming language as well.

This top part of the slide I stole from Chris Lattners’s presentation from a couple of years ago. The big recent update in the latest version of LLVM is now Fortran has become a part of this community. So going forward, the arcane powers of the Fortran wizards will perhaps become a part of this whole ecosystem as well.

Multi layer perceptron

So let’s look at some actual Swift code. What we have right here is literally going to be the first two lines of the demo. Import TensorFlow and Import Datasets. We get the import TensorFlow from our Swift for TensorFlow toolchain itself, so we get this one for free. But this second trick is actually a really important one. We get it by way of the Swift Package Manager. This up here gives us a bunch of code on the internet, so from the very first line, our code is going to reach out on the internet and be able to download the MNIST dataset for us – it will give us a nice little dataset in order to build our models on top.

So what we’re looking at here is the MNIST dataset – it’s simply a collection of handwritten numbers 0-9. The whole purpose of the network that we’re going to build is simply to categorize. We’re going to look at this and it will say, “Is this a 0? Is this a 1? Is this a 2?” Et cetera.

So towards that end, a very simple basic first type of neural network that we’re going to build is called a multi-layer perceptron. This slide right here is showing us how the math for MLP actually works, and it looks a little bit scary at first, I’ll admit. But I think the key concept to look at whenever you’re seeing this is that all these arrows are pointing in one direction. What this means is that this network is directional.

So basically, as a result, we can have a deterministic pattern if we feed data into it. So right here, this is going to be our input nodes. This will be our first actual neural network layer. These are the weights and biases that feed into this neural network node. The second one has a similar set down here as well. This H piece is going to be the activation function. Then we have a second layer of neural network nodes, with a set of weights and biases as well, in order to produce our loss function over here. Don’t worry if you don’t get all this at first, because we’ll look at it more here in a bit.

So this is what a very simple multi-layer perceptron would look like in the Swift for TensorFlow programming language. We have this flattened layer right here, and then two layers of densely connected layers, and then our final output layer. Our flattened operation is kind of a tricky one to wrap our head around, but I think this slide here at the bottom does a good job of doing so. We have right here what might look like a number one in a simple 4x4 matrix. So what we’re going to do is unroll it. We just take one row at a time and make a long string of numbers.

The MNIST dataset itself is composed of 28 x 28 black and white pixel sets, and so by extension… 28x28 is going to produce an input layer of 784 neurons. Then for our two middle layers, we’re going to use 512 neurons apiece, and then finally our output layer is going to be composed of 10 neurons. So this right here is the code that does that. We go from 784 to 512. Then 512 to 512 again. And then finally 512 to our output of 10 categories to represent our 10 digits.

A second layer right here is our forward function. This is where we’re going to actually differentiate our neural network. But conceptually, we simply take an input – whatever was given to this network, and we sync it to our flattened operation, our two neural network layers, and then finally our output will return that. So this is actually a very large piece of Swift for TensorFlow and how it works under the hood. So I think it’s important to look at it a little bit deeper.

A multilayer perceptron has this layer property. This is in turn requiring us to implement this forward function as a part of a protocol. So let’s look at that right now. Here’s the actual layer protocol definition from the Swift API’s codebase. There are a few different ways to think about protocol, but I like to think of them as being type safety for functions. We require functions to talk together in a certain way and by extension, the compiler can look at your code and it can guarantee that it can work a certain way.

Here’s the actual callAsFunction call which just wraps our input to an output. Then this forward is a piece of semantic sugar that does the same thing, but that was recently added. I simply like it because it’s more in tune with how a number of other neural network projects work. So let’s do a very simple demo of running our one-dimensional MNIST multilayer perceptron on the MNIST dataset.

Most people have not seen this, but Swift for TensorFlow has toolchains that you can download, and so you can download them and simply run them in Xcode directly. Let’s look at what that would look like right now. Here’s our very simple network, we have our two inputs and outputs. Here’s the same code we were looking at before, and we’ll look at all the rest of this here in a second. At a high level, it’s a project. We can hit the play button and after a few seconds, we’re going to actually start training our MNIST network right here on my laptop. So a very simple network – the training is getting about 94% accuracy. It’s simply running locally here on my computer.

Convolutions

Next, I think we should introduce the concept of convolutions. This is kind of a tricky concept to explain, but I think this slide is as good of a way of doing so as I have found. Conceptually, we’re going to take an input picture – this background thing – and we’re going to map it to an output picture. In the process, we’re going to go through this small little 3x3 convolution, or Kernel as it’s sometimes called. This convolution that we’re looking at right here is all 1’s, so it produces what’s called an additive blur. I think this is as simple as a convolution that you can find in practice.

What that means is that literally our result over here, this 7, it’s going to be the sum of these numbers. One + two + one + one + two produces this output. So we calculate this number, and then we move our convolutional window over a row, and we grab our next set of 9 pixels, and we repeat the process for the next output pixel. We continue this across the image, and then we go down a row and we repeat this process over and over again until at the end, we’ve mapped everything from the initial picture to the output picture.

You can do all sorts of interesting games with these convolutional kernels. This is a whole deep area of theory, but for our purposes, this layer is part of what the neural network is going to actually learn for us. So that is one of the key tricks that the neural network is going to do. Maxpool is a simple operator, I think you can wrap your head around it. The problem with this 3x3 convolutional step is that it often produces a lot of data. So this maxpool is a simple way to reduce the amount of data we’re working with.

So we literally take each group of 4 pixels, grab the largest one, and then map it to the next layer. Then, as a result, we can take a group of 16 pixels and reduce it down to 4. By extension, we reduce our data. So let’s look at what a convolutional neural network looks like using the same concepts in Swift for TensorFlow. Conceptually, it’s not really that much trickier than the stuff we were looking at before. We simply add a convolutional 2d operator, this was given to us by the Swift API’s library. We have two 3x3 operators which slowly increase the filter depth.

Then we have our maxpool, and we can just repeat our set of network layers from before. So in order to implement this, we simply then have to add one more layer to our forward function. So we take our input and we sequence it through one convolution, two convolutions, and our maxpool – then we take this convolutional layer and sequence it through our original set of notes. Conceptually now, we’re taking our input MNIST picture, 28x28 pixels… We’re running it through a layer of convolutions, and then we map down to the same 512x512 hidden layers, and finally our output of 10 nodes.

Let’s spend some time looking at the actual training process. The key idea is that we’re going to use a technique called stochastic gradient descent in order to train our model. This is simply showing how we would set up the outside of our loop before we do so. We create our model, we create an optimizer, stochastic gradient descent with a 0.1 learning rate, and then we get our dataset from the Swift models dataset handler. Then literally we’re just going to go through our data 12 times – that’s what this epoch count does. Then each time, we’re just going to update our model internally each time in the loop.

I found this picture on the internet and I really like it, because I think it really illustrates the concept of how we start with a randomly initialized network, we run through a loss update step, we repeat this process over and over again, and then finally at the end, we run it on some test data in order to produce our output accuracy and whatnot.

So let’s look at the actual gradient update step. I think this is the most important set of codes in this presentation, and by extension, I think this is where most of the magic is happening. Our first line is just sort of a hint to the global context or state that we’re needing to do things in the training mode. Then literally we’re just stepping through each batch of data in our whole set of data, we grab for each batch of data, we grab at the raw data itself as well as the labels, and this gives us images and labels. Then we literally send to our model and we have our model run on those images to produce what it thinks the predictions are for those images.

Then we run this loss function called softmaxCrossentropy. This is a nice and easy to evaluate loss function that produces good results, so we literally compare what the model gave us with what we know the answer should be. Out of all of this, we are produced with a gradient – a difference between where we are and where we think we need to be. So then we simply take our gradients and we run our optimizer, it updates the model in the direction that it thinks the gradients actually need to be. To me as a swift programmer, the one piece of little line of code that pops out to me is this ampersand right here. This makes this whole model into what’s called an in/out variable and by extension, it makes it so that other things down the chain can modify things as needed.

This is for the validation step. This slide looks a little bit scary, but if you stare at it, I don’t really think it should be. Basically once again we provide a global hint telling the Swift for TensorFlow back end that we’re just needing to run inference now, we’re not needing to update things. These are just some variables we create. Then like last time, we stepped through our dataset, but this time, our testing dataset, and we do the same sort of thing. We get the images and labels, we compare them to the models, and then for this set, we’re just simply sort of adding the difference – the TestLossSum, or the accuracy loss here – and then likewise, we get the number of correct predictions, and then we add that to our list. So how many answers we got right, how many times we queried the model in total.

So then at the end, we can simply output accuracy, which is just literally a number of correct answers divided by the number of total times the things we’re allowed to guess. Then all this down here simply just prints out some of these training statistics for us to see in the command line. So let’s do a demo of running our two-dimensional convolutional neural network on the MNIST dataset in the cloud. For this, we’ll use the Google CoLab. Many of you may think that CoLab only runs python, but actually, we can use different runtimes, and so the Swift for TensorFlow teams have worked with Colab in order to make it that you can run Swift code on Google Colab, and by extension, you can get access to GPUs and even TPUs for free.

So here we are in Google Colab. If you look here, you can see we have all this going. This first line is a really important one because this goes out and fetches our dataset and whatnot off the internet so that we can run things. The only problem is that it takes a few minutes for demo purposes. Just be aware that this works, but it takes a minute or so to get going. I’m going to cheat and use a different notebook that I have here, I ran this command already ten minutes ago. So we can see that we have a live console right here and that I’m programming things directly.

So here’s our code that we were just looking at. Now we’ll just run our whole thing as a large call to Google Colab. We’ll give it a few seconds to get running here, but then we’ll be running our MNIST demo using a 2d convolutional neural network on a GPU in the cloud. As we can see, our Colab instance has reached out to the internet. It’s downloaded the MNIST data for us, and now it’s actually training our simple neural network, but in the cloud while using a GPU. I think this is a really cool trick to know because it enables you to play with a lot of this code for free.

Autodiff

Next, I thought it would be good to take some time at looking at how Autodiff actually works under the hood because this is an important part of how the whole Swift for TensorFlow system works. Autodiff is from the 1970s or so. It’s not as new as you may think. But the recent history of adding it to the full-blown compiler and whatnot is kind of a really interesting way of extending its abilities to make it more powerful.

At a high level, you might think that we’re starting with a simple basic function. If we know the derivatives, we can calculate that and by extension the derivatives for the second step. Autodiff takes a different path, and it just sort of mechanically calculates these variables. The problem with symbolic approaches is that for neural networks, in particular, they don’t scale very well. What we’re looking at here is a soft ReLU, which is a slightly different type of activation function.

But then if we apply a ReLU to a ReLU, which is a really common pattern, all of a sudden our second level of derivatives start to look a little bit scary. The thing is, we don’t actually have to have perfect derivatives. If you remember a couple of slides back, we’re using stochastic gradient descent, and so by extension, we’re already letting a little bit of uncertainty into our equations.

The flip side of this is that if we’re allowing a little bit of uncertainty in here, then we can also instead of using absolutely correct methods, we can use approximately correct methods. So the way autodiff is often described to people is as the chain rule. That’s saying this concept of our input data, we have a layer of convolutions, another layer of neural network nodes, another layer of neural network nodes, and then our output. Then conceptually, in order to calculate this number here, we start over here and we calculate the derivatives in respect to that. Then we can go here and crack the derivative here in respect to C, and then we go back to A, track the derivative here in respect to B, and then finally get the answer that we’re looking for.

I think if your mental model of autodiff is the chain rule, then on a certain level, I think that’s technically accurate. But I think you’re missing out on some of the magic that’s going on under the hood. The way autodiff actually works is that if we reconstruct the second graph and then use our first graph as inputs to the second graph – then by extension, we can solve for the second graph to calculate this node over here, but the joy of this approach is that once we’ve calculated this, we can balance across these flows here, and by extension, we can get the original answer that we were looking for.

The compiler can literally under the hood introduce this whole second graph for us behind the scenes, and as a result, we can get the answer we’re looking for, but we have this whole second layer of stuff going on in order to significantly speed up and simplify our computations. The second level of this then is that as long of all of these input and output nodes obey certain rules, we can simply extend this approach by adding more and more layers.

So the set of rules for implementing one of these nodes are what are called the vector Jacobian products. Basically, we can convert this backprop equation into this mathematical formula that then, if we can guarantee that our nodes follow this rule, we can autodifferentiate through it. So here’s showing what the actual vector Jacobian looks like. It’s conceptually very similar to the multilayer perceptron we looked at before. We have an input, a set of weights and biases, an activation of sorts, and then our final loss output function.

The second level of this is that we can sort of keep all these variables in memory and we don’t actually have to calculate these jacobian forms. So as a result, this makes it very easy to make sparse forms of our matrixes, and by extension, this allows us to build larger and larger networks.

So let’s look at what an implementation of a vector jacobian product would look like in the Swift programming language. Swish is another activation function, it’s recently become increasingly popular for doing neural networks and deep learning. The ReLU that we were looking at before comes in here as a black line, zero, to this spot right here. Then it shoots off at a 45-degree angle. What the producers of this paper found is that ReLUs, as a result, sort of have a tendency to get over-indexed on zero. They have a tendency to try and minimize things, and so they end up down here.

As a result, this limits how far you can push ReLUs in practice. So Swish introduces this little penalty area right here when the function is near zero, and as a result, the neural networks end up moving a little bit away from zero – not trying to over-index on things. As a result, this makes the gradients smoother, and by extension, we can train larger and better networks by simply tweaking this little piece of math. This technique was popularized with the EfficientNet paper from last year, but we’re also seeing it increasingly in reinforcement learning where bumpy derivatives are a really common problem.

So right here what we’re looking at is how a simple implementation of the Swish function would look like – literally x times the sigmoid of x. On a certain level, we can actually have the computer work with this, and this will work reasonably well. It can differentiate it. We simply have two variables, and the compiler can figure out the rest for us there. The problem with this approach is that we’re basically making two copies – we’re taking input of one x, and we’re outputting two x as a result. The problem with this approach is that in practice, every time we call the Swish function, we’re going to end up doubling the memory requirements of our equation.

So now we’re looking at how Swish is actually implemented under the hood in the Swift APIs. One of the joys of understanding all this math behind things is that maybe we can hope for the computer to make things simpler for us, but on the flip side if we understand the math, we can hop in and make things simpler for the computer because we know what’s actually going to happen. So what we’re literally doing right here is computing the gradient for the Swish function manually by hand, that’s what this line is doing. So then right here, we’re just simply returning our custom gradient for the Swish function. The joy of this is that we get rid of that previous step, and so now, this keeps our memory usage down, so this Swish function can be used in practice.

Scaling hypothesis

All these sorts of little tweaks and stuff may seem academic, but they’re increasingly important in order to build larger and larger neural networks. What we’re looking at here is a TPU V-3 with approximately 2048 nodes running in parallel. The Google team has been an important driver in this whole area of building larger and larger networks and exploring the limits of what is possible computationally.

Broadly speaking, there is this idea of the scaling hypothesis which says the trick to building advanced AI isn’t really having fancy algorithms or whatnot, but simply doing simple things at a large scale. We might take a language model and give it a gigabyte of data, and it might be able to output realistic-looking data responses. But we can take that same model and say we give it a petabyte of data, and all of a sudden, many interesting full form responses will just start falling out of it.

So this whole area of building things and trying to make them larger, in particular on TPUs, is a really important area of focus today. So how do you program TPUs? Well, a large part of it is a linear algebra library called XLA. TensorFlow proper supports XLA out of the box. But like we saw with LLVM earlier, there’s a number of interesting other ecosystems that are exploring ways to use these tools as well.

Jax is a really interesting project, it takes a simple NumPy style api interface, and allows people that have NumPy code to directly convert it into XLA code, and by extension, run it on these large clusters. Since it’s almost a minimal implementation, this result is extremely performant and is a really interesting tool to be aware of if you use NumPy.

The Julia project is working on implementing XLA in their machine learning workflows. I have not followed their research closely, but I know better than to underestimate lisp people. As of Pytorch 1.6, the most recent release, the Pytorch XLA libraries have reached general availability. So if you’re coming from a Pytorch world, you can have access to these computing clusters on-demand as you need as well.

Then finally last but certainly not least, Swift for TensorFlow by virtue of using these compiler techniques can also use these XLA libraries as we’ll see here in a second. What we’re looking at here on the right is simply a graph from the recent MLPerf v0.7 results from a couple of months ago – they’re showing that Google has a TPU v4 waiting in the wings, and soon we’ll be able to train networks at least 2.x plus faster coming soon.

gcloud compute tpus create s4tf-mnist-demo \
      --zone=us-central1-f \
      --accelerator-type=v2-8 \
      --version=nightly

export PATH=~/usr/bin:"${PATH}"
export TPU_IP_ADDRESS=$YOUR_TPU_ADDRESS_HERE

export XLA_USE_XRT=1
export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
export XRT_WORKERS='localservice:0;grpc://localhost:40934'
export XRT_DEVICE_MAP="TPU:0;/job:localservice/replica:0/task:0/device:TPU:0"

So, here are the modifications needed to run our simple MNIST network on a TPU using the XLA libraries. At the top, which is the key trick, we get a handle to a remote device running in the cloud. Then we literally take our model and we move it to our device, we get our optimizer and we move it to the device. Our training loop is more or less identical, but now we have to take our data and move it to our device in the cloud as well. But then there, we can literally run our model on our device data with our optimizer, and we can update it there in the direction of our gradients. So all the rest of our logic actually remains exactly the same.

This last little line is the only little piece of XLA specific stuff here. This simply puts a large hint to the XLA compiler that it needs to stop listening for new instructions, and execute this loop. So let’s now demonstrate running a 2d convolutional neural network on the MNIST dataset using XLA and a TPU in the cloud. This first line is quite simple – this is how I created my TPU before the presentation, so I have it running and standing by. These are some shell variables that you need so that the Swift for TensorFlow code knows what TPU it’s talking to, but other than that, everything else is roughly the same.

So here we are running our MNIST dataset against the TPU in the cloud. We’ll give it a couple of seconds to get going, it takes a few seconds to wake up. It’s kind of difficult to actually observe a TPU at work, but what we can look at right here is the talk of traffic over our network interface. So what you’re seeing right here is our virtual machine streaming information to the TPU, and then getting responses back. This is how this actually works under the hood.

Recap

So to recap – we built and trained a simple convolutional neural network, we talked a little bit about how Swift, LLVM, and neural networks work in general and how Autodiff can be combined with XLA in order to run our code on arbitrary hardware. So we ran our demo locally and in the cloud using the CPU, GPU, and then a TPU on demand.

That’s the most of what I have for content. If you’re interested in playing around with this Swift code, I would highly suggest that you check out the swift-models repository – in Google Codelab, you can get things running in a web browser pretty easily. I’ve written a book, it will be out shortly. It’s called Convolutional Neural Networks with Swift for TensorFlow, so you might check that out. At 9 am Pacific on Fridays, the Swift-sig group does video meetings. If you’re interested, I would highly suggest that you listen in. They usually have really interesting talks.

In particular, I would like to say thank you to Ewa for listening to a prior version of this presentation and giving me a bunch of feedback on how to improve it. If you’re interested in Autodiff, a number of today’s slides were stolen from a set of lectures by Roger Grosse, a gentleman up in Toronto, so you might go look at his work. And then perhaps more broadly than Swift for TensorFlow, but if you’re interested in learning how TPUs really work and the whole nitty-gritty details, I think the best way to do so is to go through the Cloud TPU tutorials.

I’ve done most of them at this point in time, and I’ve learned a lot in the process of doing so, so I think that’s a really good way to get started with TPUs in particular, and then perhaps deep learning in general. Then finally you might say that you can’t afford to run TPUs all the time, and I would highly suggest that you reach out to the TensorFlow research cloud. I’ve found that they’re quite reasonable in providing credits for you, and so I’d like to say thank you to them for providing these TPUs that I used to run today’s demos. With that, I will say thank you for your time.