Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 10 – Jeff Clune (Uber AI Labs)

Stanford CS330: Multi-Task and Meta-Learning, 2019 | Lecture 10 – Jeff Clune (Uber AI Labs)


So let’s get started. Uh, it is my pleasure to introduce Jeff Clune. Jeff will be giving a guest lecture in the course. Jeff is a Senior Research Manager at Uber AI Labs and that was just formed after Uber acquired a startup that he helped lead. Uh, and before that, Jeff was the Loy and Edith Harris Associate Professor in Computer Science at the University of Wyoming. One of the things that I think has been really exciting about Jeff’s work is that it actually has spanned multiple different subfields of machine learning and artificial intelligence ranging from deep learning evolutionary methods and evolving neural networks as well as robotics. So let’s welcome Jeff. [APPLAUSE] Hello everyone. Thank you very, very much to Chelsea for the invitation. It’s an honor to be here and I’m excited to speak with you today. So, um, today I’m going to kind of think really the long-term about where are we- how we might achieve our most ambitious objectives as an AI research community. Um, but before I begin doing that, I wanted to tell you about a little bit of work that my team and my collaborators and I have done that I won’t have time to talk to you about in case you’re interested in looking it up or in case we have the opportunity to chat one on one some time. The first thing I wanted to mention is that we’ve done some work on automated ecological understanding. So this is kind of a nice opportunity for people who work in the deep learning community to kind of use our skills to help the world, make a- you know, make the world a much better place. So biologists really want to understand kind of the animals and their ecosystems, what they’re doing, how many of them there are, etc. to manage endangered populations, combat poaching, and just to generally understand ecosystems. And so if you think about it motion sensor cameras plus deep learning are kind of a perfect marriage to help wi- wildlife biologists do that. And we were able to show in this paper that it works quite well. So I’m excited to see the future of that. My collaborators and I have also done about a six paper arc in what we call AI Neuroscience. So this is trying to understand how much do deep neural nets understand about the images that they classify. For example, including a lot of these papers which you might find interesting, and then the final thing I’ll mention is we’ve done a lot of work in reinforcement learning, particularly on the question of how to do intelligent exploration in reinforcement learning. So recently put out the Go-Explore algorithm which solves Montezuma’s Revenge and virtually solves again with Pitfall which no prior algorithm had scored greater than zero on. And so I think it’s kind of interesting and important to kind of figure out how could we make these RL algorithms explore like humans do so we can make them much more sample efficient. So you can check that out if you’re interested in it. Okay. So here’s the major main subject of today’s talk. The first bit is going to be wildly speculative. I hope you don’t mind. But we’re going to have some fun thinking about like how might we get to the really, really far goals that we’ve set ourselves as a research community. And I want to warn you that, you know, this is obviously because it’s so ambitious, and so thinking kind of big picture, it’s not grounded yet in a lot of experimental evidence and it’s just a discussion. So even if you completely go by anything that I say at that phase of the talk, what it’s gonna do is it’s also going to provide motivation and context for a lot of the techniques I am going to tell you about in detail, which even if you disagree with the wild speculation in the beginning, I still think you’ll find interesting from a meta-learning perspective. And then we’ll go through those techniques in order. So let’s begin with the big picture stuff. So how- we all I think in the back of our heads are interested in trying to produce artificial general intelligence or you might call it human level AI. Or at least you’re wondering whether or not that’s possible. And so I think this is, you know, obviously like the most scientific- sorry, most ambitious quest in scientific history. And it’ll change everything, you know, every aspect of our economy, culture, it’ll revolutionize science, etc. So the question is how will we get all the way there? Not how would we make incremental progress this year and next year, but how might we really achieve this thing at the back of our community’s head, we wanted- we want to accomplish. And so I think that if you kind of take a step back and you say what is the traditional machine learning community, you go to NeurIPS, you go to ICLR, you got to ICML, what are we doing as a community? And I think that we’re kind of implicitly committed to what I call the manual path to AI, and nobody ever talks about this, it’s kind of like the fish that don’t see the water. I think that the community effectively is saying what we’re trying to do is identify the key building blocks to AI, you know. And so if you go and you look at, uh, a paper in any given conference, what does it look like? Well, there you say, oh, I think that maybe we need this building block here. It doesn’t exist yet, so I’m going to propose that we add it. Or I’m going to take this existing building block like a highway network. I’m gonna replace it with a RES network. I’m going to show that that works slightly better. So we’re kind of finding all the pieces or improving the pieces. And that kind of raises some interesting questions, which is, you know, how many building blocks are there that would be required to build an actual really complicated powerful thinking machine? You know, are there hundreds? Are there thousands? And like can we find them all one by one as a community manually? I think that’s kind of an interesting question to consider. But even if you think that we can find all those building blocks, we’re implicitly committed to some Phase 2, where we are eventually going to have to put all these building blocks together into some giant complicated Rube Goldbergian thinking machine, which is something that I just think is something that we should say explicitly because if this is the path we’re committed to, we should kind of stare at it clear-eyed and know how daunting of a challenge it is. And I just want to be clear that I think this is a really Herculean task. You’re talking about hundreds or thousands of nonlinearly interacting complicated parts, each of which took a PhD or at least a paper to get right to begin with and now they’re all interacting. So how would you debug that system? And well, if it doesn’t work, how are you going to fix it? So I’m not saying that it’s impossible, but I think it’s really, really difficult and we should know that. Um, I also think that it doesn’t really fit our scientific culture very well because we typically have, you know, each of you working in a small team on a paper every couple of years or in Chelsea’s case you know 16 papers for a year or more, I’ve lost count. But either way, what we don’t tend to do is stop and have like an entire CERN like effort or Apollo program like effort to put all these pieces together. And that might be what’s required to get all of those pieces I had on the, the last slide all together into one functioning working machine. So, uh, I think that if you look at the overall trend in machine learning recently, there’s a clear trend. And that is that hand designed pipelines give way to learn solutions over time once we have sufficient data and compute. Okay. And so this we’ve seen this over and over and over again. When we first try to solve a problem, we typically try to hand code the whole thing. That doesn’t work, so we say I’m going to hand code part of the pipeline then I’m gonna sprinkle some machine learning in. And eventually, once we have enough data to compute, we just realized we should have learned the whole thing from the beginning. This has happened with features such as HOG and SIFT giving way to deep learning. It’s happened with architecture, the best architectures now in CIFAR and ImageNet are learned, they’ve been searched for automatically. They have not been designed by humans. We’re seeing this with hyper parameters and data augmentation, and increasingly a focus of this class, we’re seeing that hand designed learning algorithms are giving away to learned learning algorithms. And so this trend suggests an alternate path to our most ambitious goals as a research community and that is what I call AI-Generating Algorithms. So the idea here is that we would learn as much as possible. This is an all-in bet on learning. And the idea is that we have one algorithm that we start and it starts off simple with simple origins and it bootstraps itself up from simplicity to ultimately being extremely intelligent and have a complicated AI. It’s gonna do this by having an expensive outer loop just like meta-learning. That is, it requires a lot of compute, but on the inner loop, what it does is produce a very sample-efficient learner. Which should be familiar to this class. And we have an, an existence proof that this can work which is earth, right? The very, very dumb computationally inefficient algorithm of Darwinian evolution produced you and all of you are extremely sample-efficient learners inside of your life in the inner loop which is your lifetime. So the question is, can we make such an algorithm? And I think that if we want to pull this off, then we need progress on three pillars. So I call these the three pillars of AI-GAs. The first one is that we need to meta-learn the architectures of the algorithm, then look the neural net architectures, for example. The second one is we need to learn- meta-learn the learning algorithms themselves which is the focus of this class. And then the third one, which is not very well, uh, talked about and kind of the least researched and the least understood is automatically generating effective learning environments. So in this, what I want to point out is that hand crafting each of these things, if you had to like, design your architecture, the algorithm, and the environments, that’s very, very slow and it’s limited by our own intelligence. In contrast, it’s better to learn all of these simultaneously and let M- um, like ML and compute do the heavy lifting. So in today’s talk, what I’m gonna do is I’m going to talk about work that we’ve been doing in each one of these pillars in turn. Quickly, I want to mention a little bit more about AI-GAs and then I’ll get back to the meta-learning algorithms themselves. So one thing I do want to point out is AI-GAs like any search algorithm are not a building-block-free approach. You still have to decide what your search space is, what your search operators are. But the hypothesis is that the AI-G pa- AI-GA path has fewer building blocks that need to be identified versus the thousands or hundreds that exist in the manual path. And therefore, it’s going to be easier to find them and get them to work together. I also want to admit right out of the gate that AI-GA is going to be way more computationally inefficient. It’s gonna require tremendous amounts of compute. But if you look back at the history of machine learning, I think that it’s okay because, um, you know, computation speeds up exponentially and some of the best algorithms were built long before we had the compute to take advantage of them, like deep learning, for example. And the early research allowed us to pounce once the computation was available. I also want to point out this top bullet, which is probably where the most important research will happen and that is efficient abstractions of whatever it was that produced the miracle that happened on earth, those will help us shave orders of magnitude off of the planet-sized computer that was required to produce us. So that’s where the real interest is. So are AI-GAs a faster path to AI than the manual path? I actually think that it’s very debatable, but I ultimately have concluded after thinking about this that I do think AI-GAs are a faster path to AI but I have high uncertainty. So I do recognize that either one could win, but in terms of the arguments for why, I think it’s in line with this trend in machine learning. I think it scales really well. It doesn’t require human genius. So I borrowed this slide from Pieter Abbeel, which typically he only has these two pie charts and I’ve added this one over here. And the idea is that we don’t have to require, you know, identify and combine all these building blocks. So I think one thing that’s interesting especially for an audience like this, which are PhD students mostly, is you know if you want to think about where you want to spend your time, I would argue that it’s really interesting to spend your time on things like AI-GAs and meta-learning because it’s kind of like if you can go back in time and ask yourself, 15 years ago, would you rather work on HOG and SIFT? They were the dominant technique. They would look way better than neural nets. But ultimately, the learning of a solution took off and surpassed and has been much more general and much more powerful. So if I could go back in time, I’d want to work on neural nets. Actually, I was working on neural nets around that. So that’s already happened. Um, but you should also if you had a time machine. Okay. So I do think that community should reallocate more effort because even if I’m right that there’s some probability mass on AI-GAs, virtually all of the probability mass right now is on the manual approach. And so we should reallocate some of our effort to this approach. So there’s more discussion in this paper here. I don’t have time to go deep into AI-GAs today. That’s not what this talk is about. But I discussed at, at length wh- who’s going to win, the manual path or the AI-GA path, I argue that AI-GA is very interesting- intrinsically interesting even if they’re not the fastest path to AI that we really have to be worried about safety and ethics concerns with them that are unique to this type of algorithm. And also that this should be considered its own Grand Challenge of science. So you can check all that out if you want in the archive paper. But now what I wanna do is start telling you about some meta-learning techniques that get us down this road a little bit. So I’m going to systematically go through these three pillars. So let’s start with the first one which is meta-learning the architectures, which also is known as architecture search. So this is a project called Generative Teaching Networks. It’s under review right now at ICLR. So fingers crossed. It’s with this fantastic group of collaborators that I really want to single out Felipe here who has definitely done all of the heavy lifting on this project. So the idea in general with architecture search is that architectures matter a lot. If you look at for example ImageNet, a lot of the gains have been architectures and therefore we should search for them ins- instead of trying to manually design them. So how might we do that? Well, a really common approach, uh, like an idea that’s at the core of all these NAS methods, which stands for neural architecture search, is that you train on real data for some short amount of time that gets you an estimate of how good an architecture is and then you do something with that estimate. You could do something really simple like just do random search and take the thing that should, uh, looked the best or you could do something really fancy like modeling the space of searches, etc. But the point is that all of them in their core tend to have this idea of training for a little while on real data, getting an estimate a- after of those, a moderate number of SGD steps and then, uh, you’re off to the races. So the question is whether or not we could speed this up and that’s what we try to do in this project. So instead of training on real data from moderate number of steps, we’re going to try to train which is a very, very few number of SGD steps and see if that can perform better. And then we’re going to use few step accuracy. So the accuracy after a few steps of SGD is an estimate of the asymptotic performance of the network if I trained it, you know, for 600 epochs for a lot of compute and data. So why might that work? Well, if you think about a, a typical training set, it might look like this, is a T- TC plot. You might have like lots and lots of this kind of zero and lots and lots of this kind of zero, for example. It might be the case that you could get away with just a few of these samples and a few of these samples. If you intelligently sampled these data points, then you could do better than just training on all of the datasets because there might be redundancies. And there’s been work that’s just basically shown that that in fact does work by sub selecting, uh, real data, you can do better. But if you think about how humans learn, we don’t always learn a task by doing that task. I don’t only learn basketball by playing basketball. And in fact, sometimes I do drills, which do not exactly resemble the sport itself. For example, this basketball drill is really common, you dribbled two balls at once which never ever, ever happens in a basketball game. But you can still learn that way. Additionally, you can actually watch videos of somebody playing basketball and learn how to play that way via observation. And you can also read a book about basketball which is really crazy because [inaudible] does- sorry, it actually does, uh- teach you a lot about the game, which I think is really interesting because it means that we can generate different types of data that might speed up our learning, more than the real than- and it doesn’t have to look like the real data. So uh- also, I wanna point out that, over time, teaching methods improved. You get better drills, better videos, and better books. So the question is, can we meta-learn to generate training data that allows us to rapidly learn a new task? And if so, that could help speed up neural architecture search. So there has been work already on this which is really interesting wi- this paper that really blew my mind is this hyper-gradients paper from Maclaurin in 2015. They treat the data as a hyperparameter of the algorithm, differentiate that through all of the SGD to the pixels of the images for MNIST, and then it learns what data it should produce to end up having a learner that trains on that data do well on MNIST. They only generated- they only lear- learned 10 samples and these are the 10 samples. And as, as you can see, they look like kind of platonic digits. And then this paper here did that with 100 samples. So one of the things that we wondered in this paper is, is instead of da- generating data pixel by pixel, which kind of leaves a lot on the table in terms of learning regularities about the data themselves, and about the search space etc., our idea is that we’re gonna learn a generator to generate data to accelerate this process, and we call that a generative teaching network. So here’s the general method, a GTN is going to generate data, which a new never seen before learning neural net, which is a new architecture, new initialization will train on the synthetic data produced by the GTN. And then we’re going to optimize the GTN with meta-learning to produce good data such that that learner performs well on the target task. After a very few number of SGD steps. So here it is in picture form. We have the inner loop. You could generate a noise vector Z, pass it into the generator, it produces data. This li- like a big batch of data. The learner iterates over a few steps of SGD, maybe 32, for example, gets its final weights, pass that out to the- to, uh, evaluate on real data, such as MNIST data, see how well it does, differentiate back through that entire process like MAML, for example, and get back to the original weight to the generator, and change the weights of the generator to generate better data, then throw the learner out and repeat the process over and over again. Does that make sense? Okay. So couple of things to note. There’s a- only a few number of steps here, so you’re implicitly incentivizing this generator to train the- to create data that enables rapid learning by the learner. And another thing is that we’re also going to meta-learn the inner loop hyperparameters like if it’s got- like the momentum, for example, in SGD. And we’re gonna do that for both the controls and the GTN to have it be a fair playing field. [NOISE] All right. So our domains are gonna be MNIST and CIFAR because this is really computationally expensive. So the first thing we ran into when we tried this is that, uh- it’s very unstable. So this is what would happen when we tried to train, and then Felipe came up with a really good idea which was to take this idea from, uh, Salimans & Kingma, which is weight normalization. And just reparameterize the weight vector w into a v vector, normalize, and then by la- multiply by a scalar g, and then we learn both the weight vector and that scalar, and that dramatically improves the stability of meta-learning. Uh, so to show you that- here, these are different hyperparameters of the algorithm without weight normalization and with weight normalization. So before we did this, we were having a lot of trying. We spent a lot of time and compute, trying to find the hyperparameters that would make this thing work. And then after we did this, it was just trivial for everything just kinda worked. So we, as a side note, hypothesize that this might be a good trick for all meta-learning. We’re gonna try to do a whole paper just on that. Uh, but if your- if your project’s not working, I recommend you try this one line code change here, it might really, really help. Here’s a look at the performance difference between weight norm and not. This was the unstable one, and this is the performance curve with weight normalization. So now that you see that, I just want to stop and say, it worked, like we didn’t know we started this, but then it would be possible to generate data, learn and consume, and then do well on real data when the learner has never seen real data. But it turns out that it does work. So this is performance with the GTN, it gets about 97.5% on MNIST. So my question to you is, what are the samples look like? What do you think? Do you think that look realistic or unrealistic? Unrealistic. Unrealistic. We have one brave soul who’s willing to speak up. Here’s what the samples look like. I would mostly agree with you. They- I mean you can tell that they’re digits. You can tell this is MNIST. But they look pretty alien and weird, which I think is quite surprising. But the interesting thing is you can train a network on these digits and then it can do just fine on real MNIST digits which is kind of crazy. Some of them look really- pretty recognizable like this three here and some of them like totally alien, like this four, there. And so the idea that unrecognizable images can meaningfully affect neural networks, is reminiscent of this finding we had on our paper from 2015, “Deep Neural Nets Are Easily Fooled, ” which I think is quite interesting. And we have lots of hypotheses for why these, uh, GTN samples are unrecognizable. So if you’re interested in that, you can ask me that in the Q&A, Q&A at the end of the talk, but we’re still kinda speculative, so I didn’t put it in the core of the talk. All right. So the next thing we noticed is that instead of randomly sampling this Z code here, we can do even better because if you really want to teach somebody fast, you shouldn’t just give them random data. You should teach them with a curriculum. So instead what we can do is we can cross off the noise generator and we can a put a- put a learned tensor here. So now what we’re gonna do is we’re going to just learn the block of Z codes, the fixed block that we’re gonna use throughout training. And the Z code is this length by the batch size and then this is the number of inter-loop steps that you do, and then you just learn each one of those numbers and pass that in to the network. So we found that, that greatly boosted performance. Here is your learning curriculum versus your no curriculum, right here. And so from here on out, all of the results I’m going to show are with the curriculum version of GTNs. Okay. So here’s the first like really, really kinda fun comparison of real data to synthetic data. What you find is that if you only have 32 steps of SGD, it is way faster to train on GTN produced meta-learned synthetic data than it is to just take real data and train on it. So real data is this blue curve here. So my first pop quiz to the class, why is that line going up? If it’s just real data, why does it go up over time? [NOISE] Anybody? Yes. Because we are learning hyperparameters? Yes. Thank you. I was gonna call on you first. Uh, yes, that’s right. Remember, we’re learning the hyperparameters of the real data algorithm too. So it’s getting a better SGD momentum and things like that over time. Yes? So did you backpropagate to 32 steps? Yes. Did you- Yes. That’s right. And in fact, what we’ll see later is we can do even farther than that. I think we got up to 128 and maybe Felipe- Felipe is amazing. He was able to push it even farther than that, I believe. But those results are not in this paper. All right, cool. So the blue curve is real data, eventually, it gets us good hyperparameters and it’s stuck. This is dataset distillation, 100 samples directly pixel by pixel learned. And this is when you have a generator which can take the advantage of all sorts of regularities across examples, dconf, priors, etc., etc. That’s- this is outer loop training. So this is meta-learning. This is inner loop training. It’s a little bit noisy, but this is real data and this is the, uh, the GTN data. And what we basically found is that it is, uh, faster to train on this data. So what we wanted then to do is go back to the original motivation, which is neural architecture search and say, can we use this synthetic data to more rapidly figure out what architectures are good in the search space? Uh, and so we’re gonna do this in CIFAR because that is the standard NAS benchmark that everybody has been working on. And if you look at CIFAR, the story is basically the same as with MNIST. This is real data over time. And this is GTN over time. And you can see that it performs way better. And then this is-that was outer loop. On the inner loop, you can see that- and you can basically get the fa- the performance, uh, of real data at 128 iterations, four times faster with GTN or for the same budget you can go to a better performance. And so this is way faster to, uh, to learn on. Now like with MNIST, the samples here are pretty weird, pretty unrecognizable, and pretty alien. I mean, you can kinda tell that CIFAR, if you’ve been looking at CIFAR images for way too long, which I have. Uh, but yeah, you can’t really recognize what any of these things are versus real CIFAR images. But yet, if you train on these, you end up performing really well. So now the ultimate thing that matters in architecture search is not the actual performance of your learner, after learning, because we don’t care about its real performance after 128 sets. What we care is if the estimated performance after we do that 128 steps of SGD, by the way that’s 128, I don’t know if you noticed it, but the CIFAR ones, uh, go for 128, so even longer. Anyway the- uh, what you care about is after 128, then we think this architecture is pretty good or pretty bad. Does that actually correlate with the true asymptotic performance if I take that same architecture and I train it forever on real CIFAR data or whatever my domain is? And what we found is that there is actually a pretty good correlation. So the top 50% of GTN architecture. So the architectures that GTN thinks are the best, tend to also be the best when you train them much, much, much longer on real data. And in fact, in the, uh, in the top 50% that correlation is 0.56 and if you take the top 10 GTN architectures, they are basically also a lot of them are in the top 10 of true performance. So it doesn’t matter that you pick this one which isn’t actually good because you also pick this one which is good. So as long as you get enough of those, you’re fine. And so, um, also what we found is that you can get as good of a rank correlation with the fake synthetic data as training on real data for in 128 steps of GTN versus 1,200 steps on real data. So to get the same rank correlation, you can either spend 128 fake data steps or 1,200 real data steps, which means that GTN is 9X faster, giving you that estimated performance for that new architecture that you’ve never trained before. [NOISE] So, um, that’s pretty exciting, and what that allowed us to do is basically, um, produce something that is competitive with the state of the art for neural architecture search. So, uh, what we- because GTN is a drop-in replacement for real data, we didn’t do this with really fancy neural architecture search algorithms because these are often really expensive, we just took random search. We either did it with real data or fake data, and then we added some tricks that people use and basically what you find in every paper in machine learning is that our method is bolded, because it works. Um, but what this is pretty cool is what it means is that, you know, in just a few GPU days, you can basically find something that’s near state-of-the-art in CIFAR, uh, computation, ah, using this sort of tricks. And what’s also interesting is that, er, there are other neural architecture search algorithms, like I mentioned, that do more fancy things but their innovations are orthogonal to using fake data versus real data. So you can hybridize this GTN technique with them and they would perform even better. We haven’t done that yet because we’re not NAS experts, but we think that’s an interesting direction for future work. So the next thing I want to point out which is really provocative but also fair warning, it’s preliminary, is that this idea should work really really well for reinforcement learning. Not a lot of people do architecture search on reinforcement learning, but I think that will change if you think back to the AI-GA paradigm. So here’s pole balancing which is admittedly very simple. And what you see here this is A2C over time, and this is the performance up to that environment step. The blue line here is taking a new neural net at every single iteration here, a new neural net and doing one step of SGD with synthetic data. Which means that this red line point here took 100,000 steps and this blue line point here, it took 100,000 steps to train up the GTN but at that point, I can zap in the knowledge of how to do pole balancing into that new neural network in one SGD step. So if I stopped and launched architecture search right there, with a whole lot of caveats like does it, does it actually, is a predictive, you know, out of distribution on new architectures, etc, that’s 100,000 times faster. Now I don’t expect that to hold up on a much harder problem, but the same ideas that worked in supervising learning might also work here in RL and actually, I would argue that they’re probably gonna work even better. Why? Because in RL, the hard part is exploration. But once you’ve learned how to solve the problem, you can teach the next network how to solve that problem really really fast and see how good it is, at least at performing the problem, which is a little bit of a subtle different question for whether or not it’s good to learning the problem. So it depends on what you want that, uh, architecture search for. But anyway, I thought that was interesting. All right. So this is the conclusion to the first mini talk within this talk or the second one if you count AI-GAs and this is for GTNs. So GTNs produce synthetic data that trains neural nets faster than real data. It generalizes to new architectures, which it has never seen before. It enables, I should have said by the way, when we train a GTN, we train our distribution of architectures, but then when we tested it, we test it out on wildly different architectures because the neural architecture search goes really far away from that initial distribution, and we found that correlation works quite well even on those totally different architectures. Uh, it enables you to rapidly estimate the performance of new architectures, er, and we think this is a really generic approach. It could work with supervised learning, unsupervised learning, semi-supervised learning and RL and we mostly focused on this and then showed you a little preliminary results on RL. It also produced a state of the art, uh, competitive architecture search algorithm but through a totally different means, which is like an exciting new tool to have in the toolbox for neural architecture search. Cool. All right. So that is the first pillar, meta-learned architectures. Now we’re going to get into the home turf of this class which is meta-learning. And so the way that I see the world and I hope this doesn’t conflict too much with Chelsea’s summary of the field from that in the other lectures, is that there are, kind of, two large camps of meta-learning. One of them I think is that you meta-learn good initial weights and then you hand it over to SGD and you. Er, let it, let it go and I would say that’s, kind of, the MAML style that Chelsea is pioneering, and then I would say the other camp is this kind of idea that you’re just going to meta-learn a recurrent neural network and it itself is going to invent its own learning algorithm within, it’s not going to use SGD at inference time, it’s just going to use the activations within the network to kind of create its own learning algorithm. This is the learning to reinforcement learning Jane Wang paper, and also the RL squared paper that came out at the same time from Rocky Duan and those at OpenAI. I think they’re both awesome, both really interesting, you’ll see work in this talk on both of them but I like to give that high-level picture, and so let’s just quickly focus on the second camp, I know you know the first camp very well, probably know both of them but this is the LRL camp. So in the outer loop, you optimize a recurrent neural net with parameters theta for lifetime performance. So you take this neural net, you deploy it in a world, say like Montezuma’s Revenge, you let it play, you see how well it does, and then either you differentiate back to the original parameters of theta, so you- the next time you deploy that net it’s a little bit better or you could do like an evolutionary algorithm and just, you know, mutate the parameters and see if it does better etc. etc. So you have the outer loop thing optimizing that initial neural net and then you deploy it and there’s no SGD going on within its lifetime if you- I like that metaphor of a lifetime. Um, it does get the reward as an input. So it can actually implement its own reinforcement learning algorithm. Now, RNNs are Turing complete, so in theory this thing can implement any learning algorithm that’s out there which is, kind of, exciting, um, and what was nice is that in this paper here, they show that it learns on its own in RL to exploit and- to explore and exploit which is kinda cool. So you put it in this maze here, it doesn’t know where the reward, is it explores over here, then over here, then over here, then over here, and once it finds the reward, it changes the activations within its own network to stamp in that knowledge, is to learn where to go and then it goes back there over and over again which are all the blue dots here, and you can see it vastly outperforms A3C which doesn’t have that capability. Um, they also show, which is really interesting, that it kind of invents, maybe, its own model-based RL algorithm within its activations and they have some arguments in here but there are also some caveats. So I encourage you to go check it out but I think it’s provocative. So the thing that I’m going to put out there for this talk is some new work that’s pushing in this direction is while it’s great to think about an RNN as Turing Complete, oftentimes just being Turing complete that- isn’t enough. We’re not going to actually search for AGI in the space of Turing machines. It’s a very inefficient representation to have your program run in. So the idea here is that materials matter. You still have to choose those building blocks when you go to do meta-learning. And so, um, you know, LRL/RL squared, those recurrent neural networks, they have to learn everything within their activations, that’s not how you learn. In your life, if you learn something in this class, for example, it is not just looping around in the firing of your neurons. That’s what happens if I give you like my telephone number and you have to remember it for a couple of seconds. But, if you remember something for, uh, anything longer than about that, it’s going into the weights of your brain and not in the activations of your brain. And so one of the things we could do is we could try to get learning to happen within the weights of the neural net not just its activations. So the first paper in this arc here is Differentiable Hebbian Learning. This is work that’s done by myself, Thomas Miconi, and Ken Stanley. And I really want to highlight Thomas here who’s really been pioneering this direction both on this paper and the next paper I’m going to tell you about. He’s a fantastic scientist. So the idea here is that we’re going to deal with Hebbian Learning. How many of you are familiar with Hebbian Learning? About 40%. Okay. So the idea that Hebbian Learning is that you can store information in the weights of the network in addition to the activations. You can now get to use both. But, what’s different about this work than anything you’ve ever heard of before is that we’re going to train the Hebbian learning with SGD. So we’re going to be able to take advantage of that very powerful tool to sculpt very carefully tuned Hebbian parameters. So just like LRL/RL squared, we’re going to have a recurrent neural network that’s going to be deployed at inference time and it’s going to have no SGD inside it. But now, it’s going to be able to use Hebbian learning to do so not just like change its own activations. So the one slide summary of Hebbian learning is that neurons that fire together, wire together. You’ve probably heard that phrase is how quickly it’s- it’s usually described. So here’s how it works, the new weight in the network, so the weight- the network- the weight between i and j at time t plus 1 equals the old weight plus a little bit, this is just a learning parameter, a learning rate parameter, of the output of the post and pre-synaptic neuron firings. So if the neurons both fire in the same direction both positive or both negative, you get a positive value, the weight increases. If they both- if they fire in the opposite way, so one fired positively, one fired negatively, you get a negative value and the weight decreases. And so these will kind of reinforce themselves. Now, that might sound like it’s hopeless and will never do anything interesting. But in fact, people for a long time have been showing that this kind of unsupervised learning rule can do lots of really powerful unsupervised learning including like PCA and associative recall where you give a few numbers of a phone number, a few digits of a phone number, and then it will return like the full phone number. You give it a few notes of a song and like the whole rest of the song comes back, and this is happening in your brain. Uh, neuroscientists had been showing for a long time. So what we’re going to do here is we’re going to do Differentiable Hebbian Learning. And so the idea is that we’re going to set up Hebb’s rule inside of a neural net and then we’re going to let SGD sculpt its parameters. So here’s how this works. You take a recurrent, we’re going to call it a plastic network because the weights themselves can change and we’re going to train it end to end with gradients. So in the inner loop, the net- network ends up updating with no SGD. It’s just kind of going through its own motions and updating itself according to these Hebbian learning rules. And then in the outer loop, we’re going to differentiate through the entire episode to update the training parameters with SGD. This is a little confusing because we’re using SGD in the outer loop, but not on the inner loop. So here’s how it works. The- the, uh, output of any given neuron is going to be a non-linear- non-linearity and then your typical weighted sum where you have- now the weights here where typical- the typical weighted center where you have a weight, which is the whole parenthesis here, times the incoming activation. But, what you have is a fixed part. So this is the normal, kind of like a normal weight. This is just learned by, uh, SGD and then you have this plastic parts. So inside the lifetime of the organism this component of the weight will change, and then this gets added to this, and so the weight over time can change. That makes sense? So this Alpha here is a term that allows each weight to have a different ratio of the fixed part and the plastic part. So SGD can choose. These weights over here, we always need them to be set to these values. So just crank Alpha to 0 and we’ll only use the green part of the weight. These other weights here can be a mixture of both and these weights over here can be like entirely plastic, and have no fixed component whatsoever, and everything in between. So SGD gets control Alpha in this case. And then, what is Alpha being multiplied by? Well, we call this the Hebbian trace here, this h. This is a purely lifetime quality- quantity. Gets initialized to zero every time. And then, the way that it works is that the new Hebbian trace is just equal to a frac- like the old Hebbian trace, a bit of that, and then also a bit of whether or not the neurons fire together, so it should wire together. That’ll make sense? Cool. So what we found on a variety of experiences that this works remarkably well and often really does better than LSDMs, and otherwise non-plastic recurrent networks. So on this, um, task here, an Omniglot at the time this was tied for state of the art with Chelsea. Here- here you go and some other, uh, papers. And basically, it does about as well, but uses this totally different method which we think is really interesting. You can also take, uh, this neural net here and you can take this really simple like kind of associative recall pattern recognizing thing. So you might give it a series of patterns here, and then you give it a partial pattern, and its job is to fill in what it saw. So this is kinda like I give you a bunch of phone numbers. I give you a few digits from one of the phone numbers that you saw and you have to give me the rest of that particular phone number. And what we show there is that an LSD, uh, sorry, a non-plastic recurrent neural net does not perform very well on this task. LSDMs actually, ultimately, perform well, but look how much faster the Hebbian network learns this task. Uh, so there’s a pretty dramatic difference there. And then we went to a much harder version of this. And we said, “All right. We’re going to give the network of series of images.” So this is now a pretty high dimensional search space. We’re going to flash these images, and then will flash it half of one of the images, and it has to fill in the rest of that image. And it has to be able to do that with an image that it’s never seen before. So this is a lot of high density information storage, right? You get shown like a couple of images and then you have to remember every pixel from every one of those, so you can reconstruct it when you only see half of it. Storing all that information in your activations might be very hard, but the Hebbian learning makes it easier because you can use the weights to do it. Sorry. So, uh, I don’t have a plot here. LSDMs couldn’t even solve it and our network actually does quite well at solving that. And then, that’s worth two million parameters which is kind of interesting because in the history of the Hebbian learning, it’s always tiny little neural nets. And so now, we’re able to scale Hebbian learning up to two million plus parameters which is kind of a new era for Hebbian learning, which is pretty interesting. So the next task that we have here is maze navigation. And so, uh, this is kind of cool because it learns, again, on its own to explore and exploit. So you drop an agent, which is this, uh, yellow thing, here in this maze. It can only see its local neighborhood. It has no idea where it is and what it’s supposed to do. What it eventually figures out over meta optimization is like, oh, I see. You want me to explore this maze. And then the second I get to the green thing, you want me to go back to it as many times as possible. That’s the reward function. So the initial random network you see here it just kind of bops around. It does nothing, as you would expect. But then, look at this thing. What it basically, I don’t know- yes. So the first time it has to explore until it finds this green thing and then it just shoots right back to it every single time. So it has remembered the maze. It knows how to explore that maze. It knows how to find the treasure and then it knows how to return to it over, and over, and over again, which is pretty cool. So this is Learning to Reinforcement Learning/RL squared down here. This, uh, [NOISE] black curve. No, sorry. The non-plastic one, red. And what you see, this is what like a normal LSDM does for example. And then, uh, if you just have uniform plasticity, so you don’t get to learn a per weight Alpha, but just one Alpha for the whole network, it also doesn’t work. It’s only once you give the degree of freedom to SGD to have that per weight Alpha within the network that it can do much, much better on this task, which I think is pretty interesting. Okay. Any questions about that? Yeah. There’s one right there. [BACKGROUND] I’m now forgetting whether it saw them all at training. I think I’m pretty sure that it, uh, at test time, we show a new images, and I can do this task. Yeah. Virtually sure that we would have done that, but- I’d have to check the paper. I’d spent a little while since I read it. Cool, yeah? [NOISE] [BACKGROUND] Yeah. I think that’s a fair thing to think about. I mean, there’s- obviously, you can store information and activations. You can store it kind of to disk in the differentiable storage and you can store it to weights. and I think there’s gonna be pros and cons to the different approaches. But I thi- both- I think it’s kind of attention with the differentiable plasticity is that the weights are a part of the- kind of like computation, and so, um, that has pros and cons like the- the DNC style stuff you kind of write it to disk and then it becomes like the input to this program. Whereas, in the Hebbian cage, you kinda get to change the program on the fly, and so, uh, it’s very hand-wavy but I think in some situations one is gonna be easier than the other, and I don’t think we know yet really where like which type of problem one shines on versus another. Um, I think you could get a DNC to do the image completion task, uh, for example. But my guess is that there are gonna be other types of tasks where you really kinda wanna bend the program in different ways temporarily that it’s gonna be hard to mold the DNC to have like a generic function that takes different inputs and does wildly different things as opposed to kinda changing a few ways to change the program. It’s just my instincts. But that’s like a really interesting area of work just to compare these different approaches to storing information. Great, those are good questions. Okay. I’ll push on. So the next thing that we wanted to look at was a subject that’s near and dear to my heart. I’ve worked on this in my lab in Wyoming on smaller neural networks, and now we’re getting to see it at scale, and that’s the idea of neural modulation. And so here, the idea is that, uh, this comes from another paper that was led by Thomas. It’s also written by Rawal et al. Um, but Thomas did the heavy lifting here. And so the idea here is we’re gonna do Differentiable Neuromodulated Hebbian plasticity, so it’s taking it one step further. And the idea is that Hebbian learning is very local. If the- you know, every little connection is kind of getting updated according to the data that’s flying through it. It’s a very difficult optimization process to kind of harness and, uh, and heard. And so what you might want is that you want learning in a set of like a certain subset of the weights only in certain situations. Like maybe once I’ve learned to solve the problem, I want like all the ways to just stay put and not do anything. I don’t want the heavy and bouncing around, and so I want to like freeze learning. But then if something happens, that I might wanna like crank learning up. But I might not wanna crank it up everywhere, I might wanna crank it only up in the part of the network that was responsible for the task I was just solving, for example. So that’s the idea behind neural modulation. In neural modulation, you can have one neuron in the network like this one here inhibit learning in another part of the network, and so we can like basically [NOISE] you could say for example, if I’m playing chess that only turned learning on the chess-playing part of the net- uh, network and turn it off everywhere else. So I don’t overwrite information and other parts of the network. Ah, that’s kind of a cartoon example. So how do we pull that off in practice? Well, we do, um, this Differentiable Neural Modulated Plasticity or we call it Backpropamine, Thomas came up with that. And the idea is we’re gonna have the same Hebbian formulation, the notation is slightly different but it’s basically the same idea as before. But the new part is now, the new Hebbian trace is the old Hebbian trace plus whether the neurons fired together or not multiplied by the output of some other neuron M, and that other neuron can be entirely a learned function of data. So that could be the thing that tells you whether you’re playing chess or not [NOISE] or you are doing- or Chelsea whether or not reward just went way up or way down or some other complicated function of the data. Now, in addition to doing that which is pretty interesting, there’s also in the paper an eligibility trace version. So what that says is, don’t just turn learning on and off in certain contexts but store information about which neurons were involved in what situation. But don’t do anything with that, just store it. And then if something happens later like, you know, in 100 steps like I had a really big reward or a really big prediction error, then go back and change the neurons that were involved in a certain way. So if you read the Sutton & Barto book, there’s a lot of work on eligibility traces in RL which is pretty interesting. So here’s an eligibility trace version of differentiable plasticity. So in a nutshell, this works even better than Differentiable Hebbian Plasticity at least on the pro- some problems. So on a simple task here where we were trying to, um, basically recognize whether or not a certain symbol had been given to us a couple of time steps ago, then what you see is that the Hebbi- Differentiable Hebbian Plasticity which is, um, the green and the non-plastic networks at all don’t solve that problem at all, and both the eligibility trace version which is we’re calling retroactive and a simple differentiable Hebbian neuro modulation version which is kind of a normal backpropamine, both of those solve that problem very well. [NOISE] And then here on this maze task, um, they both solve this problem way better than, uh, non-modulated plasticity which is not even on the charts on this. Oh actually, no, it’s here in blue. And then same thing down here on this Penn Treebank problem. Cool. So [NOISE] the final thing that I wanna show you- ah, I’ve just asked, is there any questions on that? So I think it’s pretty cool. Now, you have something wildly different than SGD. That’s controlling, uh, the- the neuron- the neural network at inference time, and it can turn learning on and off at particular connections and in different contexts in different connections, which is a lot of extra degrees of freedom for meta-learning. [NOISE] Okay, so the next thing I wanna talk about, this is unpublished work. I don’t- we’ve never presented this outside of DARPA meetings, so you’re the first people to see it. This is a project that, uh, uh, I’m very interested in which is trying to learn to continuously learn. This is done with this fantastic team here, and I really wanna call out Shawn who has been the lead author and has had come up with a lot of the innovations on the projects. So in my opinion, one of the Achilles heels of all of machine learning is catastrophic forgetting. How many of you are familiar with this term? Oh everybody, that’s really good. Okay. So then I will just do this briefly because I know at least we’re putting it on YouTube. Um, the idea is that when you’re sequentially learning a task, you first learn task A, [NOISE] then you learn task B. And what you do is typically machine-learning models when they’re learning B, they have no incentive to hold on to any of the information for A so that it is override everything they knew about A, and it corrupts A and then they lose the skill for A, right? That’s your classic catastrophic forgetting. Now, animals including yourself don’t do this. You’re able to like, you know, study for this class and go to some other class and study for that, and then go like play badminton which you haven’t played in 10 years and just pick up where you left off without corrupting your knowledge for all of these classes. And through our lives, we get better and better at a variety of tasks, and if we forget what happens gradually, not catastrophically, which is what happens in machine [NOISE] learning models. So I think we have to solve this if we want to make major progress in AI. I think it’s kind of embarrassing how little progress we’ve made actually on this problem. But that’s not for want of trying, and we’ve been working on this prob- prob- problem for very, very long. And there’s a lot of early work that was really interesting that had all these different kinds of techniques. But one thing that unifies that in my opinion is they’re all kinda manually designed. This is like the manual path to AI, I think I know how to solve catastrophic forgetting what I need are pseudo rehearsal patterns or, uh, I need to have sparse representations, so I might have an auxiliary loss for that, etc, etc. There has been recent work that I love also but I would also put it into the camp of stuff that’s manually designed. So EWC which is a wonderful paper is still kind of a hand design technique. I think I know how to solve this problem, let’s use some really cool math and Fisher information and we’re gonna try to solve it, same with progressive nets. And there’s been more and more and more and more, and it’s all kind of manually designed. But, you know, my proposal or our AI proposal that it kinda comes from the AI-GA perspective is like let’s not try to figure out how to solve this ourselves. Let’s just set the problem up, ask machine learning to figure out how to solve the problem, which is to say let’s learn to continuously learn and meta- learning the solution. So a hypothesis that we’re not smart enough to build systems that can continually learn. So let’s do, uh, the AI-GA thing and, um, you know, that is in contrast to the manual path which might say we might want sparse representations so let’s kinda like create an auxiliary loss trained for that and hope that that works. So I like meta-learning because you just get to kinda ask the system to produce what you ultimately want. So going back to these two camps of learning, now we’re going to flip to the other one. That’s more like the MAML school of meta-learning here to use that. So I know that you’re all very very familiar with, uh, this kind of meta- learning and inner-loop learning. But I’m gonna introduce you to some of the terminology that I’m gonna use because this talk- this part gets really complicated. So I don’t want to lose you. So the general gist is that, you know, with a typical MAML approach, you start with your parameter vector here and then within one inner-loop of training, you copy it and you start doing inner-loop steps. Uh, and then at the end of all of that, you evaluate on your meta loss and you differentiate back through that entire block to the original parameter vector. You take a gradient step, and then you repeat the process, right? Standard for this group. So I want to introduce some terms. I think you use these terms. I think I’ve heard you and Sergey use these terms. So I like that were maybe starting to create some standards in the community. So we call this whole process meta-training and then in particular, I don’t know if you use these terms? Oh, you do good. Okay. So you- this will be old news for you. But we had to like kinda basically really commit to using these amongst our team or we were just constantly talking past each other. Probably it’s happening in your class projects. So we call this meta-training training data. Whatever it’s training on in here. So meta-training training data and then the stuff that you evaluate here we call meta-training test data. You could also call this meta-training validation data at your preference. But I’m going to call it test for symmetry and so it’s good to keep that in your head, and then, what you do is that there’s after meta-training you have your final initial parameter vector theta m here. You now pull that over for meta-testing and now, you might want to test it on totally different data that it’s never seen before, right? So now you have meta-testing training data, which is a bit mind bendy. But once you get used to this language, it really helps clarify things, right? So the- this network here has never seen this training data. So it’s meta- testing training data. It does all that. But then you still want to test this learner on something it’s never seen. So you can’t test it on this stuff. So now you have to go over here to meta-testing testing data. Does that all make sense? Good. Glad you got it. Okay, so with all that language in mind, now I’m gonna make it a little bit more complicated because we want to do continual learning. It can’t just be IID chunks of data. Right? It has to be sequential data. So now we’re going to do that all of that, but we’re gonna do in tasks in a row. So you have Task 1, Task 2, task, you know t and then your meta-loss here because now you wanna say how much did you remember not just the most recent thing you heard- you trained on but all of it. Now, you have this, you know, your meta loss is going to be on all of the tasks, that it saw. Does that makes sense? And then you do all of that, and that’s, uh, that’s the general framework. Cool. So a couple, uh, actually at ICML, uh, Martha White gave a talk that I thought was really eye-opening. So basically we had been working on his vision of trying to learn to continuously learn for a really long time and it wasn’t working very well and then she basically put out an algorithm that did exactly that and did it really well, and it was called originally MRCL for Meta-Learn Representations for Continual Learning. Uh, but they changed the name recently in an updated version of the paper to OML and, uh, in our opinion this really validated this vision that you could learn to continuously learn. So we kind of chose to scrap what we’ve- what we were doing and build on top of their algorithm. So here’s how O- OML works. Uh, it does pretty well. So what they do is they meta- learn a representation chunk here. This red, uh, part of the network, uh, over across a meta-learning outer loop utter- iterations and then in the inner loop, they’re only gonna learn this blue, these kind of inference layers here. So they call these the TLN and so then after meta-training, when they go to meta-test, they’ll freeze this red block here and they’ll only train in these blue layers here. Okay? And what they show is that this performs really well. Like, historically catastrophic forgetting, you learn one task, you’ll learn a second task, and you’re hosed. They were showing now that after 150 classes sequentially trained on Omniglot, where each task is one class of Omniglot. You could still do pretty well, which was just kind of for me kinda mind-blowing. That you could solve cash- cash out for getting across 150 tasks. So you get on- on, uh, meta-tests training, so the stuff you already saw. You could remember 97% so you have near perfect memorization even on the first, you can still remember the first class after seeing the 150th class, which is pretty cool, of the stuff you actually saw. Now, when you go to generalize, so different instances of either the first-class or the 150th class, you know, you’re generalizing it’s much worse. It’s like 63%, which is still way, way better than chance and way better than anything that came before that I am aware of. What’s also really cool is that it learned on its own the sparse representations are a very good idea, which is cool. So, um, here is for example a method that was explicitly trying to design sparse representations, and you can see they’re pretty sparse. Uh, these are different instances of the- of the training example and this is an average across the whole, uh, dataset. But many of the neurons were dead. It’s just learned to never used most neurons. If you just do normal deep learning, where you get pretty highly active non-sparse representations. But if you look at miracle, it learned to be sparse but it also learned to use all of its neurons in its representation, which is pretty powerful. So neuron- OML gets a lot, right? It sends for online- I think it’s online aware meta-learning, maybe. I forget the acronym. Anyway, it’s- it gets a lot right, but it’s ultimately still subject to SGD because it’s training that red block and then it just hopes at inference time and meta test- test time, sorry, meta test training time that SGD is not gonna mess up the weights in that network. So the question is, could we do better? Could we try to allow more control to optimize the SGD too or at least modify the SGD so that it can basically not cannibalize its own information while it’s learning at meta test time? And so what we propose is a different type of, kind of version of this which is based on neuromodulation. So we’re going to now- we’re gonna allow neural modulation and modulate SGD. But this is a different type of neural modulation that I was just telling you about. Instead of directly gaining the plasticity, turning and learning on and off in the network, it’s gonna get the activations of the network, okay? And so if you gate the activations of the network, we call that selective activation that also indirectly allows you to control learning which we call selective plasticity. So it gets a lot easier to see when you look at the architecture. So here is a, uh, our system. You’ve got this red network here, which is your prediction network. Sorry, I swapped the colors on you. This is not meta-learn. This is inner loop learned, uh, during, like normal, like at meta test training. And then this is meta-learned here. This neuromodulatory network here and the output of this layer is a multiplicative weight on this layer here. So it gets to turn these neurons on and off depending on whatever this network thinks about that input type. So for some types of, uh, of neurons it might only let the first 30% of the neurons fire etc, etc, etc, depending on the context. And so if you look at this, what this allows is, in normal deep learning you have learned, the forward pass goes through the entire network. So you have inference everywhere and you also have learning everywhere. SGD goes back through the whole network, so you’re gonna get this process of catastrophic forgetting. But with ANML, I don’t know if I told you the name but it stands for a neuromodulated meta-learning algorithm which is a play on both MAML and reptile which we enjoyed. So ANML here, a neuromodulated meta learning algorithm, it can activate, it can gate these activations here and that means that it can both select- it can cause selected activation. So these are not going to fire. So it’s kind of taken the blue weights out of the network and that also then will affect the backwards pass in SGD so you get selective plasticity. All these purple weights will not be modified because they weren’t involved in that forward pass. So you’re giving more degrees of freedom to the network. So here is the training domain here. It’s back to Omniglot and each of the classes from Omniglot are going to be its own class or task and so what we do here during training, is we’re going to, you know, what we’d normally ideally do is we would differentiate through 600, uh, tasks in a row which is like 9,000 steps of SGD. You differentiate to that entire thing back to the initial weight vector and you’d be about your business but that is hard. Felipe has not figured out how to do that 9,000 steps of SGD. But instead, the OML paper came up with a really cool innovation and that is this particular loss function. So what they do is they say, all right we’re going to pick one particular class, the current task and we’re gonna start doing SGD on different instances of that class, 20 of them, for example, and then our meta loss will be on both instances of that class that we just saw. [NOISE] So how good can you remember what you saw and also other classes from the training set? And what that means is that you have to learn the thing you’re just asked to train on, without messing up information from other classes that you’ve learned in the past but we’re only going to do that on a little tiny sample and then we’re gonna keep doing that over and over again. That obviates the need to differentiate through 600 classes or tasks directly. So then we’ll do that for a different task and we’ll repeat that, or those are the meta, uh, meta optimizations. Cool. And then what you can do is your inter-loop loss was all, was those things and you can differentiate it all the way back to the initial parameter vector. So in MAML, they did it by choosing that exact same trick to train the red and then they have inter-loop learning SGD, normal SGD on the blue. What we do is we use that meta learning to train the red network, the neuromodulatory network which then masks the activation of this thing which is not meta trained. It’s a normal prediction network. Okay. And then at meta testing what we’re gonna do is we’re going to have a new class. We do 20 instances of that class and then, sorry, we do 15 of that class and then we check it on five more examples of that particular class. So the question is both at the end of 15 steps here you can ask how well did it memorize all of these samples that’s kind of like how well did it fit the training set and that’s meta test training accuracy here and then you can also ask how well does it do on these other samples? So how was it generalized to these unseen samples that it has never seen? Clear? So you get that. If you do it for at least one task, you get that plot there. If you do that for two tasks, then you can put a plot there and there and then you can do that across however, many classes you want to show the network and then you can stop and ask both how well did we do at everything it saw in meta test training and how well they do at meta test- testing and you get these plots. Right. So then the question is, uh, you know, how well- what we really care about is how well does it do on a meta-test test set. All right. So I wanna remind you that in normal deep learning, you have two things helping you out. You have IID training and you have the fact that you typically do multiple passes through the data, right? Multiple epochs. In sequential learning, in this particular setup, we are going to be not have IID data. So you have to contend with catastrophic forgetting. And then we’re only gonna do one pass through the data. So it’s also kind of like a few shot learning problem. Uh, and so both of those forces are gonna hurt our performance. So here’s what the OML paper showed. Um, if you look on meta-test training, so how well did you memorize the stuff that you were just shown even after 150 classes? You can see OML is doing really, really well and so is ANML. But then we extended that all the way out to 600 classes. And now you see this massive difference between OML and ANML. But then we actually started playing around with the OML a little bit. And what we realized is that if you don’t- if you let it only fine tune- I’m not gonna get back to the picture. But if you only fine tune the last of those two layers, not just the- not both of them, then you can do much better. That’s this green thing here. And then the gap with ANML is actually pretty small. Now, if you look at just training from scratch, you get this terrible performance. This is the catastrophic forgetting problem we’ve been dealing with in deep learning forever. And if you just pre-trained on the meta-training set, and then do transfer learning, you still fall off a cliff. So this is the normal catastrophic forgetting in deep learning. All right. Now, the real big difference is when you look at meta-test testing. So how well does the network generalize to new stuff that it has seen? So during meta-testing, it’s never gonna see anything that it saw in meta-training. But it did see 15 of each of 600 classes. And the question is, how well has to do with the five things of each of those classes that it’s never seen before? And what you see is that, now, you- there’s a massive gap between ANML and OML. Even the OML lack the better version of OML, as you go way out. But kind of the thing with the big picture, instead of comparing between different methods, it’s just just like, this is really cool. Like, 600 classes in, 9,000 steps of SGD in, and the network is not forgetting terribly. I mean, this is not as good performance as you might hope for, but it’s still remembering a lot about what it learned about these really early classes, which is pretty incredible. Now, in terms of what you might actually be able to expect here, we actually said. All right. Let’s see if we can get rid of catastrophic forgetting and see like what’s the best you could hope for. So the upper bound or Oracle for ANML is this red line here, which you see, you just train IID. In- instead of doing 600 classes in order, you just train IID sampled from those 600 classes. So you eliminate catastrophic forgetting. But you still have one pass through the data. So you still have a low-shot learning problem and that’s this red line here. And what you can see is that neural- the kind of ANML is really close to that Oracle there. So to some extent, the problem is no longer catastrophic forgetting, it might be just data efficiencies, sample efficient learning. But it’s really not the Achilles’ heel you know, catastrophic forgetting is, uh, what’s harming us. So that’s really exciting. The final thing I’ll show in terms of data from this particular project, is that it also like OML, it learns sparsity. So here is the activation of that balloon network, the, uh, for- the forward passed non-meta network. Before it gets gated with the neuromodulatory network, this is the output of the neuromodulatory network, and 50% of these neurons are active for these individual example samples. And after neuromodulatory activation, it’s down to 6%. So it’s intelligently learning to figure out- how to figure out which small subset of the network should be on, in order to learn without forgetting. So any questions? I hope I’m not horribly losing you. All right. So to conclude this section, um, both OML and ANML show the promise of meta-learning for learning solutions to catastrophic forgetting. And I would say more broadly. Like whatever problem it is you’re trying to solve, just set it up as a meta-learning problem, and then see if meta-learning can solve it for you instead of trying to get too clever. ANML’s advantage over OML also underscores that even within the meta-learning paradigm, which is what everybody in this room is interested in; materials still matter. You know, should you use neuromodulation, should you use Hebbian Learning, should you use a different architecture, etc. Uh, I think there’s lots of future work that could improve this even further, including trying to replace SGD entirely. So ANML is still using SGD where it is modulating it. But you can imagine the version where you just have like a giant and recurrent neural net with differentiable neuromodulation for example, and then that thing invents its own learning algorithm that learns to learn without forgetting. That would be very exciting. Okay, now we get to some, uh, non-meta learning stuff. The third pillar of AI-GAs is automatically generating learning environments. So all of you know that in meta-learning, you can’t just train on one environment. You have to train on a distribution of environments. So the question is, where will you get them? You know, how are you gonna provide the fuel to your meta-learning algorithm? And if you’d pick some, are they gonna be the right ones to catalyze learning? And I don’t mean learning on your particular problem, although that’s interesting too. But what about learning to get really far, to get sort of like really ambitious AI goals? How are we gonna do that? We certainly wouldn’t want humans picking all of those problems because we’re probably not good at it. So in machine learning, we tend to do that. That we pick the challenges and then we go and try to solve them. But the question that I think is really interesting to ask is, can algorithms invent their own problems, while they’re trying to learn how to solve those problems? And so one thing that I find really, really inspirational are these things called open-ended algorithms. Now, this is an algorithm that you turn it on, and you just let it run forever. And it continuously innovates forever. It just like runs and runs and runs. You know, we do not have any algorithms right now that you would want to let run, for like, I don’t know, more than a couple of years, a couple of months. But I’m talking about an algorithm that could run for a billion years. So can anybody in this room make an algorithm that you’d wanna run for a billion years and then come back and check on it, and it would still be doing interesting stuff? So we have seen an example of that, and that’s on Earth. The very simple algorithm of Darwinian evolution plus a planet-sized computer, plus a planet. A billion years later, is constantly innovating and doing really, really fascinating things. So can we make algorithms do this? I think that is a really fantastic and interesting challenge. And no, you do not have to be committed to using an evolutionary algorithm to do that. That’s not at all the point. The point is it can be dealt with whatever your favorite optimization algorithm is, we should be able to make algorithms that endlessly innovate. Now, evolution on Earth, and it’s not the only algorithm that does this, human culture also does this. We constantly are innovating and making new, new types of novels, and TV shows, and science, and art, and dance, etc. So what’s- one thing that’s kind of underlying all of these open-ended algorithms is that they invent their own challenges and solve them, and then the solution to those challenges end up being new problems that get solved. So for example, natural selection invented the problem of leaves high up in trees, and then the solution to that problem, in the form of giraffes and caterpillars, right? Which are two different ways to go eat those leaves. And then giraffes in turn are their own problem, that lions can prey upon and hyenas, etc. So in this work, we have this, uh, new paper called POET, which is a Paired Open-Ended Trailblazer. And I wanna highlight Rui Wang, who is- who has done all the heavy lifting on this project. He’s done a wonderful job. And the idea behind this approach is that we want to try to endlessly generate increasingly complex and diverse learning environments and their solutions altogether online i- in one big algorithm. So here’s how it works. We’re gonna periodically generate environments. So to do that, that means we have to parameterize environments. You have to have a way to search for environments. So you may have a parameter vector that specifies a particular environment. And, you know, you can change those parameters and you get different environments. And then what we’re gonna do is we’re gonna add newly generated environments to the population of environments, if they’re not too easy for the current set of agents and they’re not too hard. We’re also gonna have a tiebreaker if they’re novelty to kind of like incentivize these environments to go out and be different. And then that this is a population of environments. And then each one of those environments has an agent that’s paired in it. And then once we generate those new environments, we’re gonna start optimizing the agents to solve those environments. So that’s our paired- that’s why the algorithm has paired. Oops, yeah. So there’s a paradox in life which is really I think interesting, which is that, if you try too hard to solve a problem you’ll fail. For example, if I put you in this maze here and you’re this robot and you try to go to the, this is the start, and try to go to the goal and your reward function says only reduce your distance to the goal, then what you’ll do is you’ll go up here and you’ll bang your head against this wall forever. And that’s what this optimization algorithm does is tries to reduce distance to the goal. If you just try to seek out and go to new places in the maze or seek for novelty, you’ll trivially solve this problem which is what this, uh, maze does over here. Now, this is a more general lesson in life. For example, it applies in all of science and technology. If you go back a few hundred years to this technology, you say, I want to heat up food more rapidly, you will never invent a microwave if you only fund researchers that put- that can heat food faster, um, because to invent a microwave you have to have been working on microwave technology to produce that innovation and notice that a chocolate bar melted in your pocket, which is how the microwave was actually invented. And if you go back here a couple of millennia to abacuses, and you say, oh that things cool, it does computation. I want more computation and you only fund people who will make you better, uh, uh, you know, the things that produce more compute, then you will never invent the modern-day computer because to produce this you have to had been goin- working on lightning and vacuum tubes which had no immediate application for computation whatsoever. So the ideal- the idea is that we call this goal switching. Um, let’s skip that example. The uh- the uh, idea is that if you’re optimizing for one thing. This row-this scientist here wants to optimize for a robot that’s walking but suddenly one of your agents starts crawling will you- really well, don’t throw that out as a failure. Instead, capture that and start optimizing for that too because that might be an ultimately really good stepping stone to walking, same with balancing on one foot. Now, I do wanna mention one thing about POET, I meant to say this at the top. Uh, POET so far is not a meta-learning system, but it could be used for meta-learning. So let’s say you had to generate environments and then you- that can be fuel for your metal-learning algorithms. So this idea of goal switching, so you start optimizing your agents on one goal, if they suddenly like start doing well in some other goal, you catch chance on the wing and you start optimizing for that too, that’s called goal switching and we’ve shown over and over again in a lot of different papers that this can really pay off because it gets your agents off of local optima. So we- in this paper that we had on the cover of Nature produced state of the art adaptation in robots. In this Go-Explore paper we use that to help solve those two challenges that I talked about. And so the way that goal switching works in POET is that you have this population of different environments and periodically you’ll transfer an agent from one environment to another environment to see if it’s doing better. Okay, so here’s the particular domain in the POET, it’s just an obstacle course and its agents running through it. Tries to run through it really fast without falling and here the degrees of freedom in the environment. So the environment can make envi- environments that have different degrees of say stump height or roughness in the terrain. And here’s how it works. So we start with the parameter vector that specifies an environment phi one and then we’ll have a parameter vector that specifies a theta that’s paired with that environment. We’ll optimize theta to perform well on phi. And then what we’ll do is after Theta is pretty good at this, we’ll copy the environment to make another environment phi two which is a little bit different. And then we’ll do transfer learning. So we’ll goal switch this agent into this environment and start optimizing it here. Now optimization is in parallel both of these things are kind of learning to solve this problem together, then we might copy this environment that might be too hard, so we throw it out. We generate a new environment and we can test both of these in this environment, see which one’s better, produce a new agent optimize it there. This does not have to be a linear chain, we can kind of produce this environment, maybe the best parent for it is this, and we go on and on. Now imagine that- imagine that eventually, we produce this really really hard challenge here, and what you do is initially it looks like this is the best stepping stone to see this population, so you start optimizing this in this and it does okay but it’s stuck on a local optima. But we’re doing goal switching in this algorithm, so this thing might ultimately replace that thing, and now with a little bit more optimization on this problem might ultimately produce the best solution to that problem there and ultimately we end up with this solution to this problem which got us off of these local optima. So in this particular case we have a three-layer neural network, it’s being optimized with evolution strategies but it could be any RL algorithm, and here’s some fun videos of it working. So you have this little tiny robot here, it’s moving through this terrain, the environments start off really simple, and then over time they-the are-the algorithm starts making harder and harder versions of these problems. First, it’s kind of giving it individual challenges like only little stumps and only little gaps and only some more on rough terrain. Over time it starts producing like bigger gaps and it’s starting to combine things like medium-sized gaps and medium-sized stumps and eventually it kind of produces these really hard environments that have like gaps about as big as they can do and it’s using its lateral sensors so it’s kind of like timing its jumps and it’s really kind of pushing the limits of what this body is capable of. Here’s another challenge that was invented by POET all by itself which we didn’t even think was possible. So one interesting thing about POET is we found is if you take the most challenging environment it produces and you try to directly optimize solutions in them, it doesn’t work because there’s no curricula whereas POET is automatically implementing its own curricula. So what we did is we took the hardest environments here and we said we’ll try to be more fair and we’ll create a direct path curriculum which is a pretty common technique. So we take this parameter vector for this environment, this one here and then we linearly interpolate along that parameter vector and we start optimizing down it, and it always fails every single time because curricula are hard, designing them is hard and you- you will usually get it wrong and the harder the environment the more that it failed. So I wanted to show you this one nice example. Here is our agent. It’s on the most simple version of the problem here and it’s dragging its knee. And what you see is that ultimately the- the algorithm creates a slightly harder version of the problem, this knee dragging behavior no longer works. It ships the robot up and so the robot stands up and gets a better score, and then we are further up and then basically eventually this thing was better even when transferred back to this simple environment. So the algorithm automatically goal switches this thing back to the original where now it’s standing up and it’s performing better and with further optimization it gets a much better score of 349. Now we did the counterfactual. We went and ran this agent in that environment for a long time and it never stands up, it’s always dragging its knee. And so I think this just exemplifies why curricular learning is hard because here you had to go and work on a harder problem to get better at this simple problem and get yourself off of this local optima. So without- if you don’t do goal switching, you don’t do transfer, you never solve these really hard problems and you never generate them. So there’s been a lot of work on learning curricula, I don’t have time to get into it all today but I just want to quickly kind of fuel your intuitions with future work. We’d like to do this in harder domains and you know you could do it with multiple agents as well. But I think it’s really thought-provoking to put it in a world like this. So if you had enough compute to run POET in a world like this where there are other agents that you can barter with, buy goods from, you fight with, cooperate with, you can climb buildings deal with aerial predators, etc. You can really imagine meta-learning some very sophisticated behaviors in such a domain. This is also very related to Open AI. So currently like I mentioned POET has not yet been learned with meta-learning. Each environment is a little deterministic environment. You don’t have to learn to solve it but you could easily have all of the tasks produced by POET as your distribution that you train on or you could have each environment in POET be its own little distribution that you’re meta-learning on and you still have goal switching amongst the learners. And this is all very related, Open AI’s automatic domain randomization, if you saw that paper which worked really well and solves Rubik’s cube. So, uh, I don’t have time to go through all of those things but I think POET is really exciting because it deals with, it invents its own curricula, it hedges its bets, it endlessly innovates, it provides possibility for meta-learning and opens up a lot of really interesting research directions, but mostly it’s inventing its own challenges which is what we need for meta-learning. So if you’re interested in more on these- this subject related to POET and openendedness, you can watch our ICML tutorial on population-based methods which is at that link there. So with that, I’ll get to my last slide. Thank you for your patience. Um, and that is that I think that it’s really fair in- and interesting to this question whether or not the dominant paradigm in machine learning or the manual approach to AI is really going to get us ultimately to where we want to go and I think that’s just kind of interesting to discuss. And I proposed to you the idea of an alternative paradigm which is the AI generating algorithms paradigm, where you really go all-in on meta-learning. So you’re gonna meta learn the architecture, the learning algorithms and you’re going to learn the environments and generate them automatically. And you can imagine those three things coming together to produce one really interesting algorithm that potentially could innovate forever and get us all the way to extremely powerful forms of AI and perhaps- perhaps maybe human-level AI. And then I also introduced you to for each one of these pillars here some pretty exotic new approaches that we’ve been working on including generative teaching networks, Differentiable Hebbian Plasticity, Differentiable Neuromodulated Hebbian Plasticity, ANML and then POET, which can automatically generate environments. So I hope that you found that interesting and I wanna say thank you for your patience and for the invitation. [APPLAUSE]. I also wanna thank my collaborators and there’s at least a minute if anyone has any high-level questions or low-level questions. Yes. Um, what’s the most pervasive argument that you come across [NOISE] [inaudible] Yeah. Um, so I have a lot of them in the paper that’s on Arxiv and so you can read a more detailed thing there. But I think that the- basically, my view is that, uh, I’ll give you two answers. One, at a high-level, engineering is wonderful; it’s done really impressive things. You know, we got to the moon, we got to Mars, we built the International Space Station, we’ve built Boeing, you know, giant planes, very complicated machines, and so you could say that will keep- that will continue, and we’ll do- we’ll just build something really big. But I think it’s fair to question whether or not that will work when the thing we’re building, uh, requires, you know, is- is- maybe is- you know, maybe we’re not smart enough to do that. You know, when it’s requiring something that’s more sophisticated than something we can understand. So I think it’s fair to question that. The second thing I would say is that, um, I think that the manual path is making tremendous progress, and it’s, you know, like all of those stuff we’ve seen in the last couple years in AI has been from the manual path from in most cases, and so it’s really doing well and so the question is- I think eventually the AIJ path will kind of be under it for a while, and under-perform it, and then it will cross over and surpass it, but I don’t know when that crossover point is. The crossover point itself might come after we produce human level AI, and so that’s an argument that maybe both paths will get there but maybe the manual path will get there faster. Yeah. Can you comment, um, sort of a general question. Can you- can you comment on the relationship between the work that your group does and Uber self-driving, and- and here what I’m thinking of is the- you know, these thousands of edge cases, the long tail, and, uh, is there- can you- can you elaborate on the relationship between this work? Yeah. So, er, at the highest level, I’m in AI labs which was formed after our startup Geometric Intelligence got acquired, and it was definitely set up to be a basic research lab that was focused on all areas of machine learning, not just self-driving and then in the Uber, we have an entirely separate organization that focuses on self-driving. So just from an organizational perspective, we have totally different mandates, which is why you see here a lot of more basic machine learning fo-, uh, research that’s not just focused on that particular research application. But that said, we find those challenges interesting and we work on them too. Uh, and I can’t comment too much about internal applications of these particular ideas to that, but I can just give you a flavor of how it might be useful. You could imagine for example, in POET, that you’re automatically inventing as you mentioned, long-tail edge cases that are very very difficult and are, you know, hard for the current policy to solve, and then they- they invent the solution to those problems. So that’s useful for a variety of ways. One, is you’re collecting hard- hard- you know really hard corner cases that you can add to your set to train on, and two, you have the solution to those edge cases. So if your policy can- can consume, you know, imitation learning or learning via observation or demonstration, then it can use those individual solutions to then distill back in to the main policy. So that’s just kind of like one way that these algorithms might be useful. But more broadly, like the idea of open-ended algorithms and getting off of local optima, better search techniques, improved memory usage, more efficient learning, solving continual learning, which is definitely an issue with self-driving, and many business applications. There is a wide variety to apply it, there is a wide opportunity to apply these sorts of ideas within any company that has a lot of, uh, challenging machine learning problems which Uber definitely has. Good answer. Thank you. Yes. This is really interesting to see how these algorithms work in these kinds of simulated environment and stuff and I was thinking, working in robotics, I was thinking about how I can apply that as to what I do in robotics, and it’s like, think about it, there’s a lot of, um, problems in robotics that don’t seem to be directly addressed by this. So I was wondering if you could comment on how we take these to like a physical world where you can’t afford to fail. We- we’re still optimizing the hardware, we even don’t know what our feedback is yet. So what can we learn from these kind of meta techniques in a more physical environment? Yeah. So one thing that I would say is, uh, I recommend this paper here. This paper here, what I love about it, is that it kind of marries the very expensive power of stochastic optimization and deep reinforcement learning or evolutionary algorithms, any of these stochastic optimizers, which are really expensive and really sample inefficient but they can be done in simulation ahead of time. So it takes like stuff that you learn in simulation, and then it ports it over in the real world to using that information that you’ve learned in simulation, really really efficiently with Bayesian optimization. So I don’t have time to tell you the full method but the idea is that you can use a lot of the algorithms that I’ve talked about to kind of explore the set of possibilities, learn a variety of different strategies that are all really high-performing, like a diverse set of high-performing strategies, and then in the real world when you only have a limited amount of experiments you can do or time to learn, you can just switch between a whole bunch of really good solutions. That’s one flavor of answer. A second flavor of answer is the Open AI robotics cube strategy, which is that you learn- you use things like meta learning and POET, to, uh, train in simulation, and then immediately deploy in the real world, and because the OpenAI robotics cube what it did, was it combined two of the three pillars from AI-GAs. So it automatically generates an increasingly large, diverse set of challenging environments to train on, and then it uses the RL squared slash learning to reinforcement learn recurrent neural network that invents its own learning algorithm and in simulation, it kept on waking up in a simulated world that had a different like- you know, like a different amount of gravity, a different amount of friction, different amount of colors on the walls. So it’s always in a slightly different world and it has to rapidly figure out like which world am I in? What’s the friction? What’s the gravity? Quickly solve the problem, and it does that over and over and over again in simulation at great compute cost. But then when you drop it in the real world, the real world is just another new challenge, and it’s like, “Oh, what’s the friction here? And what’s the weird lighting effect here? And what- like how do I flip the cube?” And then it can kind of rapidly- and be a sample efficient learner that solves the Rubik’s cube, and that’s what they showed that it did. So you can harness- compute ahead of time to learn- to learn efficiently and then just deploy that in the world. So I think those are two different techniques, both of which are key- can be used to great effect in robotics and Chelsea’s done a lot of great work showing that that works, and so has Sergey, and several other people in the field and now Open AI. And so I don’t think that it’s the case that robotics has to wait for AI to get better and more sample efficient. I think that now you can already start taking advantage of these techniques and Open AI I think is the best showcase of that. There was one more back there. I don’t know when you have to call it though. Just one more. Okay, we’ll take the one more question that was back there. Ah, just a quick thought. Can you guys consider applying POET to the Minecraft RL challenge or something. [inaudible] Yeah. We have talked about that. I think that’s a fascinating direction. I’m not familiar with that particular challenge right now but just the idea of like POET plus Minecraft, I think is, uh, really great because it’s such an open-ended world it would be really fascinating to see what it does. Yeah. Great question. Thank you all again. I appreciate it. [APPLAUSE]

Leave a Reply

Your email address will not be published. Required fields are marked *