introduction to artificial intelligence

An overview of the history of artificial intelligence, some important milestones, and a look towards the future for the Medipol Summit with DSC group of Istanbul Medipol University in Turkey.



Good morning, thank you all for stopping in to listen, and thank you Vusal for inviting me. Today we’re just going to talk about neural networks and artificial intelligence. This is a very large and broad field. I thought I would give you my take on things, how I think about how it all fits together. So towards that end, we’ll look at some of the history, where they came from, we’ll look at neural networks, and some of the different types and interesting historical milestones in building this field.

Then we’ll talk about some of the interesting applications, some problems that have proven themselves to be approachable by these techniques. Then at the end, we’ll talk about the scaling hypothesis, just this idea that in the upcoming decade, we’re just going to be building things even larger and with more and more data and compute, and what the potential implications of that are. Then from that, broadly, we’ll try to look forward towards the future and make some predictions about the future of this field.

Historical context

Really, we have to go all the way back - this gentleman on the left here, this is an early picture of Alan Turing, who was an important British mathematician during World War II. He did a lot of theoretical work, but one of the things he developed was called Turing machines, this idea that by encapsulating state and following a certain set of patterns, you can build machines or more broadly computational machinery in order to tackle larger and larger problems - the theoretical framework for computers as we know them today.

The gentleman on the right is named William Shockley. He is one of the key people who invented the transistor, literally just a tiny diode that you can turn on and off, but the combination of these two things, the theory of the computation, and then the practice of actual circuits, transistors, combined to produce integrated circuits, and then more broadly, computers. But really, a lot of it comes from these two gentlemen in the 1950s. What’s interesting is that a surprising amount of this field was really there, these people back then were actually thinking about it and building it and working on it as early as the 50s and early 60s.

Symbolic logic, this idea that we can solve problems just using nothing but pure reason if we can create truth systems and things like that - it was a well-established field very early on. The statistics people were not to be left out, they understood very early that they could use these devices to approach problems using statistical methodologies, and so really, I would say that machine learning proper comes out of this era. This combination of computers and data and then a little bit of statistics, I think was actually established way back then.

This gentleman over here on the right is a gentleman by the name of Frank Rosenblatt. He was a psychologist and in 1959, he invented something called the perceptron. This was a way of taking an input, then you can have a computer learn the differences in the inputs, and then have it output a result - an output neuron, really the birth of this whole neural network field.

Now, we’ll fast forward to not quite the present, but we’ll say 2010 or so. But honestly, I feel like a lot of this stuff was already there at that point in time over 50 years ago. Basically, computers got faster, we gathered more and more data, but we found that a lot of these basic algorithms that were used in the early days did a reasonably good job of scaling up.

One of the joys of statistical approaches is basically as you add more and more data, they get larger and larger, but they still continue to work, and you can tweak them easily. You can add more and more of these small models, and so basically, the whole field of machine learning dominated the field, up until around 2010 or so. I put this graph here because I think it illustrates the key idea. But basically deep learning, the pieces were there, but what we needed was more and more data, and more and more computational power before these approaches actually started to make sense.

So we can argue about this point where, say, the green line crossed the red line. But we’ll say probably the key moment was sometimes whenever somebody started doing this stuff on GPUs - somewhere around 2012. For most problems, long before you start to try something with deep learning or anything fancy like that, you should look into the tool bag of tricks that machine learning has picked up over the years and start there. If you have something working with machine learning, deep learning may make it fancier or better, but if you can’t get it working with machine learning, then deep learning probably isn’t going to help either.

Convolutional neural networks

So what we’ll try to do now is go through a whole bunch of different types of neural networks, I think one of the best places to get started with neural networks is this field called convolutional neural networks, I’ve talked about this a decent amount over the years. On the upper left, we have our little bitty perceptron, just literally mapping an input to output, from 1959. The basic problem of perceptrons is that they tried to do too much to try and go from the beginning to the answer. It’s asking too much of the neural network, and so the next big step was called feed-forward neural networks. They just add this set of green nodes, and what’s called a hidden layer, and this allows a layer of indirection between the input and the output, but this makes it much easier for the neural network to learn the patterns and representations of the data.

If one layer of indirection is good, then two layers are even better, so to speak. So we have deep feedforward networks, an input, two layers of hidden or indirect nodes in our output. This is a really common pattern, and you’ll see it in a whole bunch of different other places in this field, I think it’s a valuable one to pick up. Convolutions are just another area of computer vision theory, although they’ve expanded out to tackle a whole bunch of other things. I think the key idea of convolutions, it’s just basically that they’re a cheap way to break up data and work with them, and so by extension, by combining them with neural networks, we have convolutions to do some of the heavy liftings, and then we use neural networks at the end to do the actual pattern recognition step.

This bottom thing is simply just trying to illustrate a deep convolutional neural network, we have an input set of convolutional layers, two hidden layers, and then finally an output layer. This may seem like a theoretical approach, but what we’re looking at right here is actually the state of the art for image recognition in this field as of about 2014. This is called the VGG network, and it’s quite literally nothing more than convolutions, we’ll say an input image, a set of convolutions, this red as a max pool, which is a step to squash the data down. Two more sets of convolutions, max pool, three sets of convolutions, a max pool, three sets of convolutions, a max pool, and then finally, these blue nodes are literally the 4096-node hidden layer, a second layer of 4096 nodes, and then finally, we have 1000 output nodes to map to our results that we’re looking for.

ImageNet has a slightly larger data set. It’s like a million images or so, but it’s a really good practical way to test these convolutional neural networks at scale. Resnet is a paper that came out about a year later as a really important milestone in the field. The basic problem of this convolution approach is that as we go from the input to our output, the more and more layers we put in the middle, it becomes harder and harder for the network to learn, and by extension, any noise in the process quickly creates limitations at that approach.

So Resnets really are a very simple thing, but basically, we just add this skip connection - this extra layer to our network is skipping around our nodes, and this is literally done at the programmatic level of just simply adding our input back to the output of our network, but this simple little trick allows you to build larger and larger networks. It’s been proven to scale to thousands of layers. You’ll see increasingly on some of our later slides that it’s a really valuable paper for you to know.

Recurrent neural networks

Convolutions, we can say go from an input image to a result. This works great for visual data, but many other problems in the real world actually need to be reasoned about using a time approach. So recurrent neural networks are the really important other area of research that you need to be aware of. This is an older paper from 1986, whenever people draw a slide, they usually use this rolled form over here on our left. So we have a set of nodes, and we have this loopback node, but I don’t really like this way of visualizing it because it makes it a little too mind-bendy for me. So this unrolled form is a better way to think about it, I think. We have an input - phase one, phase two, phase three, phase four input - we map the initial input to a node and then into the output, and then the second one takes some of that input for its output, the third one takes some of the second one in, and then so on, and so forth.

Then this is a really powerful approach because we can start to model time-series data, and crucially, language, as we’ll see. The basic problem with this recurrent approach is that it’s a little bit too simple, or rather, it can capture recursive forms and stuff like that, but it has a real tendency to get stuck on repeating itself. So the next important paper in this field is what’s called long short-term memory, or LSTM modules. This is from a gentleman named Schmidhuber in 1997. On the left, you’re seeing the RNN architecture, how the actual math works underneath the hood. So the LSTM adds this state in its internal history tracking thing, and then the module itself can remember things or choose to forget things assuming that another node down the line will remember the thing.

So this dramatically improves the basic recurring approach, and allows you to start to model long-range dependencies, which is to say things like text, but also LSTMs have been used in other domains, such as video and things like that. So then seq2seq, this is a paper from 2014 that used LSTM modules to translate language. The key idea of the seq2seq is that we augment our input data in order to significantly improve the results of our output network. We take the ABC in one language and pair it with the WXYZ in another language as a tied input, and then these two things, we map them together, and then we reward the network for learning the original output.

So conceptually we might take a sentence in one language, “I am a student,” and then we compare it with its pair in another language. So the Spanish form, “yo soy estudiante,” but then combining these two together, we tell the network to learn that whenever it sees, “I am a student,” it’s supposed to output “yo soy estudiante.”

What’s interesting then is that whenever we start to expose this with lots of data and many samples, you can give it a new example, like “I am a teacher.” The network will begin to correctly predict what that result will look like, even though it hasn’t really seen it before. The key idea of this is that the network is starting to learn long-range dependencies out of the data it has seen, which brings us to this transformer model, which has become popular in the last couple of years.

Transformers came out in 2017, and conceptually, you might think of it as a larger, much more complicated version of our LSTM module from before. But basically, this approach has really taken the world by storm, so to speak. It’s proven itself in a whole bunch of different domains, as we’ll see here in a second. So what we’re seeing here on the left is just trying to illustrate how the transformer architecture looks, and how it’s modeled. These are the bits and pieces. The key concept right there that you might stare at is this Nx right here.

Basically, one of the key ideas of transformers is while we’re doing this, we can also do this in parallel. So we can have a whole larger collection of these sorts of things running at once. Over here on the right, we’re getting a little bit ahead of ourselves, but I’m just trying to show you that transformers perform better than LSTMs on larger and larger data sets, and even for the same amount to compute, they’re actually outperforming these LSTM modules. So that’s part of why they’re really an important step forward for this field.

Hybrid approaches

MDETR is a paper that came out pretty recently from Facebook. I don’t know that you necessarily need to know it, per se, but I threw it up here because I think it’s really important to understand that a lot of modern networks use these different pieces together. So at the top, we have a cat picture. We’re running it through a convolutional neural network and taking its outputs. At the bottom, we take a sentence describing this picture. So “a cat with white paws jumps over a fence in front of the yellow tree”", we run it through RoBERTa, which is like a variant of a transformer network. We take the outputs from that network, then we concatenate - literally, just squash these results together, and then we run a transformer on top of that.

But what’s really interesting is that we get a full-blown logical model. Literally, the network can say, “Oh, the cat part of the sentence refers to this - this cat that’s jumping. The yellow tree is this piece, the white paws…” it can literally pick out the individual paws of the cat, and then it can also note what else is not in the picture, so to speak. So this gives us a really powerful way of combining convolutional approaches with natural language approaches to produce an object detection network that can really start to almost reason about a picture, we might say.

Generative adversarial networks

Then, if we can think about building a larger network by combining pieces and networks, we might also start to make the next logical jump and think about having multiple networks working together to solve a problem. So GANS, or generative adversarial networks as they’re called, are an idea from 2014 or so. The basic key idea is that you have two networks sort of battling each other in order to work together to produce a common result.

What we’re seeing here is what’s called stylegan. This is a paper from 2018 or so where the computers are literally hallucinating these faces, this is just all what the computer thinks that we would think a face would look like. But as you can see, these results have started to become really realistic looking and so a lot of people are really interested in this area right now.

A full breakdown of how GANS work is maybe a bit beyond us, but the key idea is that we have are two networks working together. So the first network is training itself to recognize pictures, and then the second network is training itself to fool the network that’s recognizing pictures. So this is what we’re looking at here, the loss training curve for a GaN network as it works. The key idea is on this little spot down here, where I put a blue star, where these loss curves cross over each other.

What happens is that the first network learns, and eventually it starts to become stable and it starts generating consistent results. At that point, the second network can leapfrog it so that it starts to learn on top. Then as the second network starts to improve, the first network actually starts to have to improve as well. So together these two networks working together can reach a much better understanding of the problem than either would be able to do on their own. That’s a really interesting concept, and it’s one that we will see again as we go forward from here.

Reinforcement learning

Reinforcement learning is a whole large field in and of its own, separate from neural networks. But much like we saw with machine learning being overtaken by these deep learning methods, as of a few years ago, these neural network approaches have become really commonplace in this field and are increasingly where the new stuff is happening.

This is from a paper from 2015, it’s illustrating what is called a DQN – Deep Q Network. This takes an input image, or a set of images- we’re using a screen capture of an Atari video game and it runs it through some convolutions, a set of fully connected layers like before, and then finally it maps a result to actions, such as the actions of what a joystick should do. But this is a fairly basic-looking network if you look at it. It can play Atari better than a human after a little bit of experimentation on a number of different historical games.

So this is an interesting area where the Deepmind team in particular has really been trying to work to be able to tackle these sorts of video games on the fly. To create these agents, it can see something and then reason out what the best thing to do next is.

This brings us to the AlphaGo family of engines. I’ve done a couple of talks on this, so if you’re interested in this particular subject, you can look those up. But the basic core concept of AlphaGo well, there is the AlphaGo engine itself, and then there are a few different variants. But the AlphaZero one, in particular, is really interesting as a milestone for reinforcement learning. If you look at the left, you’ll see how the actual AlphaZero network was put together. It’s interesting how it’s 40 of these residual layers – basically, the same concept from the Resnet we were looking at before – combined with sets of 3x3 convolutions, batch norm, the same building blocks as our VGG network and our Resnet, and then a custom output logic at the very top. But other than that, it’s just really a very large network. Conceptually, there is not really a ton of magic here, so to speak. Or rather, the magic is in its size, but not so much the methodology.

Over on our right, we have the training curves for the AlphaZero as it was learning the play by playing itself over time. So this purple line is our traditional machine learning style approach. This is the original AlphaGo master engine, but it’s trying to learn from human games. It actually does really well as you can see at the start. It very quickly improves and learns from the games it’s shown so it improves rapidly and it’s able to play the game at a high level. Then by having human input, or by learning from human games, and in turn, ultimately, it’s limited by human games. Then the dark blue line which is illustrating how the AlphaZero engine is training over time.

As you can see, the AlphaZero starts out really poorly. It plays the game terribly for a long time, compared to the other engine. But somewhere over 20 hours, it starts to teach itself how to play the game and it’s improved to something on a similar level to our AlphaGo master engine. But then after that, it’s doing its own thing. It’s able to steadily improve until it’s beyond any human player has ever been, and by extension, it’s the greatest Go bot of all time. I think this is just a really interesting milestone for reinforcement learning and AI in general.

This is from the AlphaStar paper which came out in 2019. They were taking the same concept of trying to do reinforcement learning by applying it to the domain of StarCraft. So, a slightly different and significantly more modern video game. What we’re seeing here on the left is how the actual neural network for this thing works and how it’s all put together. It looks really scary, but I don’t think it necessarily has to be so. When I think of this as being a really complicated set of inputs, we’re sort of trying to take a bunch of different pieces of the StarCraft game and map them together – but the core is just an LSTM module. It’s the exact same stuff we were looking at before. Then all the stuff on top is kind of a fancy output, just to give the model the ability to control units within the videogame.

At the top on the upper right corner, we’re seeing the progression of the different reinforcement agents as they played over time. The agents learn a strategy, and then once they learn the strategy, they will quickly start doing it all the time. Then after a while, the other agents will learn a counter-strategy. These strategies versus counter strategies and so on and so forth – we can see over time that the group as a whole is steadily improving until at the end, it was able to play agents human players and be competitive.

We might think of agent 600 over here on the bottom right as making moves in response to attacks that the very first generation came up with. So it’s never even seen this original plan, but it knows how to battle them, without ever having encountered them in actual combat.

This is AlphaFold2, it’s another paper from DeepMind last year. This is conjecture, it’s not entirely 100% clear on how they did this, but they were able to significantly advance the state of protein modeling using these neural network approaches. The core of this network is a transformer model as well. It’s using some form of axial attention based around the protein MSA’s. These methodologies are not just for games, but we’re increasingly seeing them applied to real-world problems. So it’s interesting to see these techniques migrate out into the broader world. I think it would be interesting to see how many other fields can potentially get massive improvements, and our understanding of them by applying these neural techniques.

Scaling hypothesis

I thought we could switch to start talking about building things larger and larger. This GPT-3 model came out of OpenAI last year. Basically, they took the approach from the GPT-2, which was a transformer-based model. This one is fourth from the right down here at the bottom, and they basically made it significantly larger. This is sort of a linear model here, but I think it’s really just illustrating how much larger the GPT-3 model was, and these various predecessors in comparison as an extension as to why it was able to significantly capture much more data from the data that was shown and be able to build a larger and much more complicated and powerful model.

There’s this essay on the internet called the Bitter Lesson by a gentleman named Sutton (h/t gwern). Basically, he says that the field of machine learning for the last few decades has seen really smart people who have tried to come up with more and more clever ways of tackling different problems. They’re trying to add more and more domain expertise, but at the end of the day, the approaches that have proven themselves haven’t necessarily been complicated, but rather simpler approaches.

I like this guy on the bottom, he’s saying, “STACK MORE LAYERS.” It illustrates this mentality of let’s not be smart, so to speak, but rather let’s just see if we can make things larger and larger and we’ll see what happens as a result. If we can get better and better results by building things larger, then I think the question is why aren’t we trying to build things larger? Why try to be smarter than the machine, if this is the best way for them to learn problems at the end of the day?

Scaling Laws for Neural Language Models is an interesting paper from a gentleman named Kaplan, which is illustrating that across all of these different domains, basically as we get more compute power, or bigger and bigger computers, and more and more data, we can build larger and larger models. So by extension, if we have a larger computer and we have more data and we can build a larger model, we can see this semilinear scaling – but basically, we can just get better and better results as we continue to scale these things up.

Take a look at the slide on the left, because there is a later version of the same thing. The key concept down here at the bottom is that this is up to 10^4. So even with three orders of magnitude more compute power, we’re seeing these trends still hold. And by extension, it’s able to continue to improve the loss on this language model even further. Here is a slide from OpenAI. I think it really illustrates everything I’ve been talking about today altogether. If we go to the bottom left, we have the Perceptron way back in 1959. As we see over time, basically continues continued to get faster and faster. As a result, we were able to build larger and larger and more powerful approaches.

Then in 2012, that’s a reasonably good spot for predictions in this era. We can see the computers have grown drastically. We’ve been able to build these approaches more and more powerfully, and we can utilize these approaches in much more interesting ways. We have AlexNet here, the paper that put Deep Learning on the map with the VantageNet competition of that year. There’s the VGG model we were talking about before. Our Resnet comes shortly thereafter. Then the Neural Machine Translation is semi-related to the seq2seq paper I showed you before. Then the Dota is similar to the AlphaStar paper. On the top we have the AlphaGoZero, so you can conceptualize how much computing power is required. Then if we added the AlphaFold2 paper to this, it would be a little red dot somewhere around the “o” of the “zero” up there in the AlphaGoZero, if that makes sense. Then our DQN is down here below, somewhere around the year 2015.

So to me, the first clear trend that we can see out of all this is simply that compute power is only going to continue to need to be scaled up in order to build these devices. Google’s been at the forefront of this with its TPU machines. What we’re looking at here is a TPU V3 cluster, which has a large collection of customized edge AI processors, literally computer devices designed specifically running these neural networks or matrix multiplication math, all working together as a whole.

I think this is an interesting area in general because these same devices that we can use up in the cloud can also be individually split out, and then run on devices in the field. So this gets a really interesting feedback loop where we can do things in the cloud, and we can also easily distribute that same logic to devices in the field, and run them where they are actually needed.

A key piece of how all this works isn’t so much the compute processor itself, but the ram, we’ll say. The amount of memory in the state that the processors can hold and work with. Then the key flip side of this is the bandwidth, the ability for all these nodes to talk together. So one of the key capabilities of how the TPUs work is they have a dedicated network for all of these processors to talk together, and by extension, be able to work together on larger and larger problems.

Then the field of high-performance computing or supercomputing in academia and in the government has resisted these neural network approaches these last couple of years, but I think in the last year or two, they have really started to get on board with rethinking how we’re building supercomputers. They’re trying to make these computations run better and larger at scale. The key concept of this is what we’ll call systolic designs, which is making sure that the computer and the network and the ram are all working together in harmony and that one or another isn’t trying to starve each other.

More and more data – it’s only clear to me that datasets are only going to get larger, and we’re going to have more and more of them. These image recognition models I’ve shown you are actually usually only using 200x200, or slightly larger samples to try and reason about the world. So to me, whenever you start talking about 4k video running on a device at 60 frames per second, I think there is still a long way to go before being able to run these things out in the real world on a device in real-time and building things that can run that as fast as possible.

Our ability to gather data has only gotten better over the years, so we’re able to build larger and larger data sets, and by extension, these data sets are starting to scale up to gigabytes or even terabytes in size. There are various internal datasets at some of these large companies which are petabytes, and people have even rumored at larger ones out there. A key piece of all this is just annotating your data in general. I still think this is an open problem, just finding better ways to annotate data. This is a field that I think is poised to become really popular in the next year or two, this field of semi-supervised learning.

This is where you label a little bit of data, we used our computer to use our few labels to try to make sense of a larger set, and then by iterating between these two, we can build up larger and larger data sets without actually having to manually look at every single picture, which doesn’t scale. Then the picture on the right is from a large dataset called The Pile. It’s a group of volunteers put together as a data set. It’s nearly a terabyte in size, and it’s their hope or belief that they’ll be able to train a GPT-3 style model using just this.

R&D in general is really interesting to me, our whole theoretical approaches to all of this. The basis of all the neural network methods I’ve shown you today is a set of algorithms around what’s called automatic differentiation, or more broadly, the backpropagation algorithm. That’s to say that everything I’ve shown you today is kind of like very fancy forms of matrix math done at larger and larger scales. But as we try to make these matrix multiplications larger and larger, we start to hit some fundamental limitations with how fast all the nodes will communicate together. So that’s where a lot of the interesting research is going on right now.

This slide on the right is from a paper about predictive coding that came out last fall where they’re trying to make the neural network learn how to do its automatic differentiation itself on a node-by-node basis, so you don’t actually have to have a global update step. That’s really interesting because something like this could potentially significantly reduce the number of inner connections that are required, and by extension, allow us to rethink this backpropagation algorithm approach that seems to be the foundation of things where they are right now.

There are a couple of really interesting software projects that are trying to rethink this whole scaling of all this stuff so that we can build larger and larger networks. This Jax project is from a team at Google. They are trying to rebuild the foundations and generating code to run on TPUs, it’s extremely performative and efficient. They’ve had a number of interesting successes in the MLPerf competition. Notably last year they trained a ResNet network in about 30 seconds, which was pretty wild.

Then this DeepSpeed Team works out of Microsoft. They are working in conjugation with the OpenAI team there, but they have done some really interesting stuff with trying to improve parallelization and trying to reason about how to build larger networks without necessarily having to have a supercomputer on-demand run all of this stuff. So I think that’s a really interesting project for you to have your eyes on as well.

AI Winter

AI is not a new field, as I said. In the 1960s, it came out and it was extremely hyped. So at that point in time, somewhere around 1970, they had the “AI Winter.” The claims had gotten more and more outlandish and hyped, and then it became clear that the field wasn’t going to be able to deliver on all of its promises, and so as a result, a lot of funding disappeared. And by extension, things went quiet for a few years – literally decades. A number of people look at AI as being pretty hyped right now, so people are wondering if that’s going to happen again.

I can’t speak entirely upon the future, but I don’t think that the situation is quite the same this time around. To me, the big reason that things fell apart in the 1970s is simply that the commercial applications of things just weren’t figured out at all. So today, we have this whole Bitcoin mania or just crypto in general. I don’t see that going away any time soon. We have all of these commercial applications of these neural networks, so computer vision and natural language processing like I’ve shown you have really grown a lot of interest in this field because they can solve real-world problems on a day-by-day basis.

I have not talked much about recommendation systems, but this is a really large area of real-world modern production of AI systems, or rather if you’ve ever clicked on an ad somewhere, someplace there is a recommendation system that’s processing your input. Then like I said before, I think this concept of autonomous devices is not really new. But in the last decade or so, we’re finally able to build this computing power to where we can actually run things on a device and start to do things in the field. I really think this opens up a whole world of interest and real-world applications for things to be tackled using these AI approaches.

All of these things above basically won’t just go away. By extension, they are driving revenue at large companies, and I think all of these above are only going to mean that we’re going to have more computing for these sorts of things, more data, and ultimately, more R&D in the future. So as a result, I don’t feel like the AI Winter is going to happen again, or at least not the way it happened the last time around.


That leads us to this broad question of whether or not we can actually develop some sort of general AI, or AGI as it’s sometimes called. A tricky question is, what is intelligence? We have this ability to reason and interact with our environment to solve new problems, but can we define that in a way as… What is it about us that makes us different from everybody else around us? And by extension, are we special in this universe?

There is a school of thought that if a computer can do it, then it’s not really AI, it’s just a fancy math problem. But I think there is a certain truth that we might say that being able to play Go at a high level, that super-specialized neural network isn’t the same as not tripping over your shoelaces whenever you walk down the street. But I would play off of what I was trying to go at with the AlphaStar earlier – you don’t really think about how to convert the air you’re breathing into energy. But there are all these sorts of processes that are inside of us that have all been figured out already – you don’t really think about all of that. So really, we may think of ourselves as being generalized, but that generalization is really only on top of a whole large pile of specialized sub-routines. That’s the process of life, we’ll say. So as we get more and more of these pieces and we build these systems larger and larger, I think it’s entirely possible that we may find out that a certain amount of computing density is simply the key to actual reasoning.

All of the problems I’ve shown you today are closed problems – we have a mathematical objective function, we’re solving this math problem to minimize loss, or not lose a game, but in the real world, we have an open world problem. We’re just trying to interact with the environment and then by extension try to maximize our results. So we may be thinking about the problem all wrong, and it will ultimately be something much more complicated than just a pure math problem.

Having said that, I found this article on the internet from a group of people who are attempting to make projections about when AGI might be possible. I’ll link an interesting article at the end of this where they explore a lot of that, “If this is the limits of intelligence, then when could we potentially be able to build a computer that can do something simpler?” At the end of the day, they’re mixing a bunch of Gaussian models together to arrive at this projection, so it’s guaranteed to converge and it has a 50% probability at some point in time.

I think sometime in the next century, we could potentially know whether or not these approaches are going to scale to solve this problem. It’s just a really interesting one to explore in the upcoming decade. To me, this whole next 5-10 years is just really interesting because it’s going to show where we are at on this curve, and whether or not we’re being too optimistic, or if there actually could be something there.


With that, I have to say that the future is you. None of the modern era techniques I’ve shown you were possible 10 years ago. By extension, I think you can’t really say what’s going to be possible in 10 years in this field. You may think that everybody has a gigantic advantage over you and that you can’t possibly contribute, but I would say the reverse – there has literally never been a better time to get started with these techniques. All this knowledge where it once took years to learn, now you can learn it in months or even weeks, or sometimes days.

There are whole sets of tutorials and groups of people online, so there is a whole gigantic community of people out there that are interested in this stuff as well. So if you’re wanting to get started on it, you can very much jump in and get going. The best route to do that is to go through the Fast.AI courses. Jeremy has done a lot of work to simplify all of these things down and really tried to make it as simple as possible. You should get yourself a GPU if you can and start doing stuff on your own device, then you won’t have to worry about cloud credits or any of that stuff. Then I would suggest that you try to find a problem – something that’s interesting to you, and see if you can find a way to bring these theoretical techniques to a real-world problem, and then by extension, you can become the world domain expert in that tiny little niche.

There is a group of people on the internet on, they have a discord server you can join and they have a number of interesting discussions. If you’re looking for a source of new ideas, I would highly suggest you join their setup. With that, I will say thank you all for listening.


Vusal: All right, great. That was actually an amazing presentation. Did you make that from scratch?

Brett: I did one last year that I stole a little bit of the content from, but I updated it with some new slides and some things I’ve been thinking about recently. They’ve been rattling around in my head, and this gave me an excuse to gather them together.

Berkay: Could you please share the slides, if you can, because there was a request from the audience.

Brett: Yeah, I have a website –, and I will post my slides there right after this.

Vusal: Okay, I’m sharing Brett’s site in the chat right now. Make sure to visit that. Any questions, feel free to ask in the chat.

Berkay: I see a question about cryptocurrency.

Brett: (Can we use AI to predict bitcoin prices?) Hmm, probably. But I see people trying to tackle financial problems with AI, and I would probably go backwards to traditional methods. There is a whole world of people who do stock trading with well-understood machine learning and things like that, and I would start there before I start trying to throw deep neural networks at something like that. Learn the existing theory before trying anything new, I would say.

Vusal: Whenever that’s been asked to any AI experts before, without even letting me complete the question, they tell me it’s impossible.

Brett: There is a company called Jane Street and they do a lot of really interesting stuff with algorithmic training, so you might want to research them and some of the stuff they do.

Vusal: Sure! We’ll link that in the Discord, and I’ll try to share that afterward. I have one more question for myself regarding AI. As you know, especially like scrolling down Instagram with the memes there regarding AI and so on, it’s always saying that anything standing behind AI is actually the math. So how accurate is that? Is the AI really consisting of just math?

Brett: Definitely the foundation is math, but I think perhaps the mathematicians try to take credit for everything and I think there is a lot to be said for the computers on top. Then also, I think the computer people try to take credit for stuff – the statistics is the key thing, it’s not so much the pure math, but just understanding probabilities and reasonings around that. One of my personal frustrations is that sometimes you’ll read a paper and they have a pile of math. Then you look up the actual code, and it’s like… Thing A + Thing B + Thing C and they multiply everything together. Oftentimes, the code implementation is much clearer. That’s something I often do, if I don’t really understand a paper, I ask for the code. If they don’t have code, that’s a bad sign. But oftentimes, I think programmers have a leg up in this field. People always throw the Greek symbols around, but if you multiply something together in a row, then we have pi functions or something like that that most normal people have never seen, but it’s a formality that’s kind of scaring new people away.

Vusal: Alright. If you have any other questions from the audience…? We’ll be covering about certificates afterward, so don’t worry about that. But if you have any specific questions regarding the section on AI, feel free to ask. But I liked the history of it, it was great. Thanks so much for taking part in this event.

Brett: Thank you all for having me, have a good day!