Week 1 – Lecture: History, motivation, and evolution of Deep Learning

OK, so first of all I have a terrible confession to make. This class is actually being run not by me, but by these two guys: Alfredo Canziani and Mark Goldstein, whose names are here [points to slide]. They are the TA’S and you’ll talk to them much more often than you’ll talk to me. That’s the first thing. The other confession I have to make is that if you have questions about this class, don’t ask them at the end of this course because I have to run right after the class to catch an airplane. But that can wait until next week. OK, so let’s start right in. Some very basic course information. There is a website as you can see. I will do what I can to post the PDF of the slides on the website. Probably just before the lecture, probably just a few minutes before the lecture, usually.But, but it should be there by the time you get to class, or at least by the time I get to class. There’s going to be nine lectures that I’m going to teach, on Monday evenings. There is also a practical session every Tuesday night that further on Mark will be running. So, they’ll go through some of the, you know, practical questions. Some, you know, refreshers on sort of math, you know mathematics that are necessary for this, and basic concepts. Some tutorials on how to use PyTorch and various other software tools. And there’s going to be three guest lectures. The names of the guest lectures are not finalized. But it’s gonna be on topics like natural language processing, computer vision, self-supervised learning, things like that.There’s going to be a midterm exam, or at least we think there is. And it’s going to take one of those sessions, you know, around March. And the evaluation will be done on the midterm and on a final project. And, you can sort of, you know, band in groups of two. Did we say two or three, or just two? We didn’t decide yet, we’ll see. The project will probably have to do with a combination of self-supervised learning and autonomous driving.We are discussing with various people for data and things like that. Okay, let me talk a little bit about, so this first lecture is really going to be sort of a broad introduction about what deep running is really, and what it can do and what it cannot do. So it will serve as an introduction to the entire thing. So, we’ll go through the entire arc, if you want, of the class but in, sort of, very superficial terms so that you get, sort of, broad high-level idea of all the topics we’re talking about.And whenever I’ll talk about a particular topic you’ll see where it fits in this kind of whole picture. But before that… So, there is a prerequisite for the class which is, you know: You need to be kind of familiar with machine learning or at least basic concepts in machine learning. Who here has played with PyTorch, TensorFlow, has trained a neural net? OK. Who has not done that? Don’t be shy, OK.OK? So the majority has. Which is good. But I’m not going to assume that you know everything about this. Particularly, I am not going to assume that you know a lot of the, you know, sort of deep underlying techniques. OK, so here is the course plan and depending on what you tell me I can adjust this and sort of, you know, go faster on certain sections that you think are too obvious because you’ve played with this before, or other things. So intro to Supervised Learning, Neural Nets, Deep Learning. That’s what I’m going to talk about today. What deep learning can do, what it cannot do, what are good features. Deep learning is about learning representations. That’s what I’m going to talk about. Next week will be about back propagation and basic architectural components. So things like the fact that you build neural nets out of modules that you connect with each other. Then you compute gradients you get automatic differentiations. And then these various types of architectures, loss functions, activation functions you know different modules.Tricks like weight sharing and weight tying, multiplicative interactions, attention gating things like this, right. And then particular macro architectures, like mixture of experts, Siamese net, hyper networks, etc. So we’ll dive pretty quickly in and that’s appropriate if you’ve already played with some of those things. Then there will be either one or two lectures I haven’t completely decided yet about convolutional nets and their applications. One of them might end up being a guest lecture. Then, more specifically, about deep learning architectures that are useful in special cases. So things like recurrent neural nets with back propagation through time, which is the way you train recurrent neural nets. And … … and sort of … … applications of … recurrent neural nets to things like control and, you know, producing time series and stuff like that. Then things like combine recurrence and gating and multiplicative interactions like gated recurrent units and LSTM. And then things that really use multiplicative interactions as kind of really a basis of their architecture like memory networks, transformers, adapters, etc., which are sort of very recent architectures that have become extremely popular and in things like NLP and other other areas.And then a little bit about graph neural nets, which I’m not going to talk about this a lot because there is another course that you can take by Joan Bruna where he spends a lot of time on graph neural nets. Then… then we’ll talk about how we get those deep learning systems to work. And, so, various tricks to get them to work. Sort of understanding the type of optimization that takes place in neural nets. So… …the type of… you know, we… we use, of course learning is always about, almost always about, optimization. And deep learning is almost always about gradient-based optimization. And there are certain rules about optimization in the convex case that are well understood. But they’re not well understood when the training is stochastic, which is the case for most deep learning systems. And they’re not very well understood also in deep learning, because the the cost function is not … is not convex. It has local minima, and saddle points, and things like this.So it’s important to understand the geometry of the objective function. I say it’s important to understand but the the big secret here is that nobody actually understands. So … it’s important to understand that nobody understands. Okay? But there are a few tricks that have come up through a combination of intuition and a little bit of theoretical analysis and empirical search. Things like initialization tricks, normalization tricks, and regularization tricks like drop out. (Grading clipping is more for optimization.) Things like momentum, average HDD, the various methods for parallelizing HDD.Many of which do not work. And then something a little exotic called target prop and the Lagrangian formulation of back prop. Then I’ll switch to my favorite topic, which is energy-based models. So this is sort of a general formulation of a lot of different, sort of, approaches to learning whether they are supervised, unsupervised, self-supervised. And whether they involve things like inference. Like, you know, searching for the value of variables that nobody tells you the value of, but that your process your system is supposed to infer. So that could be thought of as sort of a way of implementing reasoning with neural nets. So you could think of reasoning in neural nets as a process by which you have some energy function that’s being optimized with respect to some variables. And the value you get as a result of this optimization is the value of those variables you were trying to find.And so, there is sort of the the common view that a neural net is just a function that computes its output as a function of its input. So you just run through the neural net, you get an output. But that’s a fairly restrictive form of inference in the sense that you can only produce one output for a given input. But very often there are multiple possible answers to a given input.And so, how do you, kind of, represent problems of this type where there are multiple answers, multiple possible answers, to a given input? And one answer to this is: You make those answers the minima of some energy function, and your inference algorithm is going to find values of those variables that minimize this objective function. And there might be multiple minima. So that means your model might produce multiple outputs for a given input. Okay? So, energy-based models, it’s kind of a way of doing this. A special case of those energy-based models are probabilistic models: like, you know Bayesian methods, graphical model, Bayesian nets and things like this.Energy methods are a little more general. So a little less specific. So special cases of this include things like what people used to call structure prediction. And then there is a lot of applications of this in what’s called self-supervised learning. And that will be the topic of the next couple lectures. So self-supervised learning is a very very active topic of research today. And probably something that’s going to become really dominant in the future. It’s already … … in the space of a year, it’s become dominant in natural language processing. And, in the last few months, just three months there’s been a few papers that show that self-supervised learning methods actually work really well in things like computer vision as well. And so my guess, my guess is that … self-supervised learning is going to take over the world in the next few years. So, I think it’s useful to hear about it in this class. The things like I’m not going to go through a laundry list of this but there are things that you may have heard of: like variational auto-encoders, de-noising auto-encoders, BERT, which is, you know, those transformer architectures that are trained for natural language processing.They are trained with self-supervised learning and they are a special case of a de-noising auto-encoder. So a lot of those things you may have heard of without realizing they were all, kind of it can be all understood in the context of this sort of energy-based approach. And that includes also generatative adversarial networks (GANs), which I’m sure many of you have heard of. And then there is self-supervised learning and beyond. So, you know, how do we get machines to really kind of become really intelligent? They’re not superintelligent. They’re not very intelligent right now. They can solve very narrow problems very well, sometimes with superhuman performance. But no machine has any kind of common sense. And, the most intelligent machines that we have probably have less common sense than a house cat. So, how do we get to cat-level intelligence first, and then maybe human-level intelligence? And, I don’t pretend to have the answer, but I have, you know a few ideas that are interesting to discuss in the context of self-supervised learning, there. I’ve had some applications.Any questions? So that’s the plan of the course. Okay. it might change, dynamically But, at least, that’s the intent. Any questions? …Okay… [Student Question: Will we also be having assignments in the course?] Yeah, yeah, there are assignments. […Inaudible…] Okay, so for those of you who didn’t hear Alfredo, because he didn’t speak very loudly: The final project is actually going to be a competition between the teams. So there is going to be a leaderboard and everything. And, in preparation for this, the assignments will be basically practice to get familiar with all the techniques that you would need for deep learning in general, but for the final project in particular.[Also for the midterm.] Right, also for midterm, obviously. Okay, so most of you probably know that and this is gonna probably be boring for some of you who’ve already played with those things but let’s start from the basics. Deep learning is inspired by what people have observed about the brain, but the inspiration is just an inspiration. It’s not, the attempt is not to copy the brain because there’s a lot of details about the brain that are irrelevant, and that we don’t know if they are relevant actually to human intelligence.So the inspiration is at kind of the conceptual level. And, it’s a little bit the same as, you know, airplanes being inspired by birds. But the underlying principles of flight for birds and airplanes are essentially the same, but the details are extremely different. They both have wings. They both generate lift by propelling themselves through air, but, you know, airplanes don’t have feathers and don’t flap their wings. So it’s a bit of the same idea. And the history of this goes back to a field that has kind of almost disappeared, or, at least changed names now, called Cybernetics. If you want a specialist about the history of cybernetics, he is sitting right there: Joe Lemelin Can you raise your hand? So, Joe is actually philosopher. And he is interested in the he actually has a seminar on, kind of, the history of AI.In the what department is this? [Joe: Media, Culture, and Communication.] Media, Culture, and Communication. So, not CS. He knows everything about, you know, the history of cybernetics. So it started in the 40’s with two gentlemen: McCulloch and Pitts. Their picture is on the top right here. And they came up with the idea that, you know, people at the time were interested in logic, but neuroscience was a very sort of nascent field.And, they got the idea that if neurons are basically threshold units that are on or off, then by connecting neurons with each other, you can build Boolean circuits and you can basically do logical inference with neurons. So they say, you know: The brain is basically a logical inference machine because the neurons are binary. And this idea So the idea was that a neuron computes a weighted sum of its inputs and then compares the weighted sum to a threshold. It turns on if it’s above the threshold, turns off if it’s below. Which is sort of simplified view of how real neurons work a very, very simplified view. That model kind of stuck with the field for decades. Almost four decades. Actually, a full four decades. Then there was, you know, quasi-simultaneously Donald Hebb, who had the idea that neurons in the brain it’s an old idea that the brain learns by modifying the strength of the connections between the neurons that are called the synapses.And you had the idea of what’s now called Hebbian learning, which is that if two neurons fire together, then the connection that links them increases. And if they don’t fire together, maybe it decreases. That’s not an idea for learning algorithm, but it’s sort of a first idea, perhaps. And then cybernetics was proposed by this guy Norbert Wiener, who is here. [Bottom Right] This is the whole idea that by having systems that, kind of, have sensors and have actuators, you can have feedback loops and you can have, you know, self-regulating systems.And, what’s the theory behind this? You know, we sort of take that for granted now. But the idea that you know, you have things, like kind of, for example: You know, you drive your car, right? You turn the wheel and there’s a so-called PID controller that actually turns the wheel in proportion to how you turn the steering wheel. And it’s a feedback mechanism that basically measures the position of the steering wheel, measures the position of the wheel of the car. And, then, if there is a difference between the two, kind of corrects the wheels of the car so that they match the orientation steering wheel. That’s a feedback mechanism. That, the stability of this, and the rules about this basically all come initially from this work. That led to a gentleman by the name of Frank Rosenblatt to basically imagine learning algorithms that modified the weights of very simple neural nets.And what you see here at the bottom the two pictures here [Bottom Left]: This is Frank Rosenblatt, and this is the Perceptron. This was a physical analog computer. It was not a three-line Python program, which is what it is now. It was a gigantic machine with, you know, wires and optical sensors so you could show it pictures. It was very low resolution. And then it had it had neurons that could compute a weighted sum, and the weights could be adapted. And the weights were potentiometers. The potentiometers had motors on them so they could rotate for the learning algorithm. So it was electro-mechanical. And what he’s holding here in his hand is a module of eight weights with (you can count them), with those potentiometers, motorized potentiometers on them. Okay, so that was a little bit of history of where neural nets come from. Another interesting piece of history is that this whole idea of, sort of, trying to build intelligent machines by basically simulating networks of neurons was born in the 40’s, kind of took off a little bit in the late-50’s, and completely died in the 1960’s, in the late-1960s when people realized that with the kind of learning algorithms and architectures that people were proposing at the time you couldn’t do much.You know, you could do some basic, very simple pattern recognition, but you couldn’t do much. So between 1969, roughly, and or 1968 and 1984, I would say, basically, nobody in the world was working on neural nets except a few kind of isolated researchers mostly in Japan. Japan is its own, kind of, relatively isolated ecosystem for funding research. People don’t listen to the same kind of … … fashions, if you want. And then the field took off again in 1985, roughly, with the emergence of backpropagation [backprop]. So backpropagation is an algorithm for training multi-layer neural nets, as many of you know. People were looking for something like this in the 60’s and basically didn’t find it. And the reason they didn’t find it was because they had the wrong neurons. They were using McCulloch-Pitts neurons that are binary. The way to get backpropagation to work is to use an activation function that is continuous, differentiable, or at least, continuous. And people just didn’t have … you know, the idea of using continuous neurons. And so, they didn’t think that you could train those systems with gradient descent, because things were not differentiable.Now there’s another reason for this, which is that if you have a neural net with binary neurons, you never need to compute multiplications. You never need to multiply two numbers. You only need to add numbers, right. If your neuron is active, you just add the weight to the weighted sum. If it is inactive, you don’t do anything. If you have continuous neurons, you need to multiply the activation of a neuron by a weight to get a contribution to the weighted sum. It turns out, before the 1980’s, multiplying two numbers, particularly floating point numbers, on any sort of non-ridiculously expensive computer was extremely slow. And so there was an incentive to not use continuous neurons for that reason. So the reason why backprop didn’t emerge earlier than the mid 80’s is because that’s when, you know, computers became fast enough to do floating point multiplications, pretty much. People didn’t think of it this way, but that’s, you know, kind of retrospectively, that’s pretty much what happened.So, there was a wave of interest in neural nets between 1985 and 1995 lasted about 10 years. In 1995, it died again. People in machine learning basically abandoned the idea of using neural nets. (For reasons that I’m not gonna go into right now.) And that lasted until the late 2000’s, early 2010. So, when around 2009/2010 people realized that you could use multi-layer neural nets, training with backprop, and get an improvement for speech recognition. It didn’t start with ImageNet. It started with speech recognition, around 2010. And within 18 months of the first papers being published on this, every major player in speech recognition had deployed commercial speech recognition systems that use neural nets. So, if you had an Android phone and you were using any other speech recognition features in an Android phone around 2012 that used neural nets. That was probably the first really, really wide deployment of, kind of, modern forms of deep learning, if you want. Then at the end of 2012 / early-2013, the same thing happened in computer vision, where the computer vision community realized deep learning, convolutional nets in particular, work much better than whatever it is that they were using before, and started to switch to using commercial nets, and basically abandoned all previous techniques.So that created a second revolution, now in computer vision. And then three years later, around 2016 or so, the same thing happened in natural language processing in language translation, and things like this: 2015/16. And, now we’re going to see it’s not happened yet but we’re going to see the same revolution occur in things like robotics, and control, and, you know, a whole bunch of application areas. But let me get to this: Okay. So, you all know what supervised learning is, I’m sure. And this is really what the vast majority and not the vast majority 90-some percent applications of deep learning use supervised learning as kind of the main thing. So supervised learning is the process by which you collect a bunch of pairs of inputs and outputs of examples of, let’s say, images together with a category (if you want to do image recognition). Or a bunch of audio clips with their text transcription: a bunch of text in one language with the transcription in another language, etc.And you feed one example to the machine. It produces an output. If the output is correct you don’t do anything, or you don’t do much. If the output is incorrect then you tweak the parameters of the machine. (Think of it as a parametrized function of some kind.) And you tweak the parameters of that function in such a way that the output gets closer to the when you want. Okay. This is in non-technical terms what supervised learning is all about. Show a picture of a car, if the system doesn’t say car, tweak the parameters. The parameters in the neural net are going to be the weights, you know, that compute weighted sums in those simulated neurons. tweak the knobs so that the output gets closer to the one you want. The trick in neural nets is: How do you figure out in which direction and by how much to tweak the knobs so that the output gets closer to the one you want? That’s what gradient computation and backpropagation is about.But before we get to this, a little bit of history again. So there was a flurry of models, basic models for classification. You know, starting with the Perceptron, there was another competing model called the Adaline, which is on the top right here. They are based on the same kind of basic architectures: compute the weighted sum of inputs compared to a threshold. If it’s above the threshold, turn on. if it’s below the threshold, turn off. What you see, the Adaline here, the thing that Bernie Widrow is tweaking is actually a physical analog computer again. So it’s like the Perceptron, it was much less, you know, much smaller, in many ways.The reason I tell you about this is that the the Perception actually was a two layer neural net, a two layer neural net in which the second layer was trainable with adaptive weights. But the first layer was fixed. In fact, most of the time with most experiments it was, It was determined randomly. You would, like, randomly connect input pixels of an image to neurons that would, you know, be threshold neurons with random weights, essentially. This is what they called the associative layer. And that basically became the basis for the sort of conceptual design of a pattern recognition system for the next four decades. I want to say four decades. Yeah, pretty much. So, that model is one by which you take an input, you run it through a feature extractor that is supposed to extract the relevant characteristics of the input that will be useful for the task. So, you want to recognize a face? Can you detect an eye? How do you detect an eye? Well, there is probably a dark circle somewhere.Things like that, right? You want to recognize a car, you know. Well, they are kind of dark, round things, etc. So, the problem here and so, what this feature extractor produces is a vector of features, which are things that may be numbers, or they may be on or off. Okay, so it’s just a list of numbers, a vector. And you’re going to feed that vector to trainable classifier. In the case of Perceptron or a simple neural net, it’s gonna be just the system that computes a weighted sum, compares it to a threshold.The problem is that you have to engineer the feature extractor. So the entire literature of pattern recognition (statistical pattern recognition at least) and a lot of computer vision (at least the part of computer vision that’s interested in recognition) was focused on this part: the feature extractor. How do you design a feature extractor for a particular problem? You want to do, I don’t know, Hangul character recognition. What are the relevant features for recognizing Hangul? And how can you extract them using all kinds of algorithmic tricks? How do you pre-process the images? You know, how do you normalize their size? You know, things like that.How do you skeletonize them? How do you segment them from their background? So the entire literature was devoted to this [Feature Extractor], very very little was devoted to that [Trainable Classifier]. And what deep learning brought to the table is this idea that instead of having this kind of two-stage process for pattern recognition, where one stage is built by hand, where the representation of the input is the result of, you know, some hand-engineered program, essentially.The idea of deep learning is that you learn the entire task end-to-end. Okay, so basically you build your pattern recognition system, or whatever it is that you want to do with it as a cascade or a sequence of modules. All of those modules have tunable parameters. All of them have some sort of nonlinearity in them. And then you stack them, you stack multiple layers of them, which is why it’s called deep learning. So the only reason for the “deep” word in deep learning is the fact that there are multiple layers. There is nothing more to that. And then you train the entire thing end-to-end. So the complication here, of course, is the fact that the parameters that are in the first box: How do you know how to tune them so that the output does you know, goes closer to the output you want? That’s what backpropagation does for you. Okay, why do all those modules have to be nonlinear? It’s because, if you have two successive modules, and they’re both linear, you can collapse them into a single linear.Right, the product of two linear functions, or the composition of two linear functions is a linear function. Take a vector multiply by a matrix and then multiply it by a second matrix. It’s as if you had pre-computed the product of those two matrices, and then multiplied the input vector by that composite matrix. So there’s no point having multiple layers if those layers are linear. Okay, there’s actually a point, but it’s a minor point. So, since they have to be nonlinear, what is the simplest multi-layer architecture you can imagine that has arameters that you can tune things like weights in the neural nets and is non-linear? And you realize quickly that it has to look something like this: So …Take an input. An input can be represented as a vector, right. An image is just a list of numbers. Think of it as a vector, ignore the fact that it’s an image for now. Piece of audio, whatever it is that your sensors or your data set gives you, is a vector. Multiply this vector by a matrix. The coefficients in this matrix are the tunable parameters. And then take the resulting vector, right when you multiply a matrix by a vector, you get a vector. Pass each component of this vector through a nonlinear function. And, if you want to have the simplest possible nonlinear function, use something like what’s shown at the top here [ReLU(x) = max(x, 0)], which people in neural nets call the ReLU; people in engineering call this half wave rectification; people in math call this positive part.Whatever you want to call it, okay. So, apply this nonlinear function to every component of the vector that results from multiplying the input vector by the matrix. Okay. Now you get a new vector, which has lots of zeros in it, because whenever the weighted sum was less than zero, the output is zero, if you pass through the ReLU. And then repeat the process: Take that vector, multiply it by a weight matrix; pass the result through point wise non-linearity; take the result, pass it multiplied by a matrix pass the result through nonlinearities. That’s a basic neural net, essentially. Okay now Why is that called a neural net at all? It’s because when you take a vector and you multiply a vector by a matrix, to compute each component of the output, you actually compute a weighted sum of the components of the input by a corresponding row in the matrix, right.So this little symbol here there’s a bunch of components of the vector coming into this layer. And, you take a row of the matrix, compute a weighted sum of those values where the weights are the values in the row of that matrix, and that gives you a weighted sum. And you do this for every row that gives you the result, right. So, the number of units after the multiplication by a matrix is going to be equal to the number of rows of your matrix. And the number of columns of the matrix, of course has to be equal to the size of the input.Okay. So supervised learning in slightly more formal terms than the one I showed earlier is the idea by which you’re going to compare the output that the system produces… So, right, you show an input, you run through the neural net, you get an output. You’re going to compare this output with a target output. And you are going to have an objective function, a loss module, that computes a distance, a discrepancy, penalty whatever you want to call it.Divergence okay, there’s various names for it. And then you’re going to compute the average of that. So the output of this cost function is a scalar, right. It computes the distance, for example, Euclidean distance between a target vector and the vector that the neural net produces, the deep learning system system produces. And then you can compute the average of this cost function, which is just a scalar You’re going to average it over a training set, right. So, a training set is composed of a bunch of pairs of inputs and outputs; compute the average of this over the training set. The function you want to minimize with respect to the parameters of the system (the tunable knobs) is that average. Okay. So, you want to find the value of the parameters that minimizes the average error between the output you want and the output you get, averaged over a training set of samples. So… I’m sure the vast majority of people here, sort of, have at least an intuitive understanding of what gradient descent is.So, basically, the way to minimize this is to compute the gradient, right. It’s like you are in a mountain, you’re lost in the mountain (and it’s a very smooth mountain), but there is fog and it’s night, and you want to go to the village in the valley. And so, the best thing you can do is you turn around and you see which way is down, and you take a step down in the direction of steepest descent. Okay. So, this search for the direction that goes down: that’s called “computing a gradient” or, technically, a negative gradient. Okay, then you take a step down. That’s taking a step down in the direction of negative gradient. And if you keep doing this and your steps are small enough small enough so that when you take a step, you don’t jump to the other side of the mountain then eventually you’re going to converge to the valley, if the valley is convex. Which means that if there is no, kind of, lake, no mountain lakes in the middle where, you know, there’s kind of a minimum and you’re going to get stuck in that minimum the valley might be lower, but you don’t see it.Okay. So that’s why convexity is important as a concept. But here is another concept, which is the concept of stochastic gradient, which I’m sure again a lot of you have heard [of]. I’ll come back to that in more detail. The objective function you’re computing is an average over many many samples. You can compute the objective function in its gradient over the entire training set by averaging the the value of the entire training set.But it turns out it is more efficient to just take one sample or a small group of samples, computing the error that this sample makes, then computing the gradient of that error with respect to the parameters and taking a step. A small step. Then a new sample comes in you’re going to get another value for the error and another value for the gradient, which may be in a different direction because it’s a different sample. And take a step in that direction. And if you keep doing this, you’re gonna go down the the cost surface but in kind of a noisy way there’s going to be a lot of fluctuations. So what is shown here is an example of this. This is stochastic gradient applied to a very simple problem with two dimensions where you only have two weights. And it looks kind of semi-periodic because the examples are always shown in the same order, which is not what you’re supposed to do with stochastic gradient.But as you can see the path is really erratic. Why do people use this? There’s various reasons. One reason is that, empirically it converges much, much faster, particularly if you have a very large training set. And the other reason is that you actually get better generalization in the end. So if you measure the performance of the system on a separate set that you I assume you all know the concepts of “training set” and “test set” and “validation set” but So if you test the performance of the system on a different set, you get generally better generalization if you use stochastic gradient than if you actually use the real, true gradient descent. The problem is… yes [inaudible student question] No. It’s worse. So computing the entire gradient of the entire dataset It is computationally feasible. I mean you can do it. It’s not any more expensive than… …you know… [inaudible student comment] It’ll be less noisy but it will be slower.So let me tell you why: I mean, this is something we’re gonna talk about again when we talk about optimization. But let me tell you: I give you a training set with a million training samples. It’s actually 100 repetitions of the same training sample with 10,000 samples. Okay. So my actual sample is 10,000 training samples. I replicate it 100 times and I claim that, you know, I scrambled it. I tell you here is my training set with a million training samples. So if you do a full gradient, you’re going to compute the same values a hundred times. You’re gonna spend a hundred times more work than necessary. Without knowing it. Okay. So this only works because of repetition. But it also works in, kind of, more normal situations in machine learning where you have samples that are, have a lot of redundancy in them, like very many samples are very similar to each other, etc. So if there is any kind of coherence if your system is capable of generalization, then that means stochastic gradient is going to be more efficient because if you don’t stochastic gradient, you’re not going to be able to take advantage of that redundancy.So that’s one case where noise is good for you. Okay. Don’t pay attention to the formula. Don’t get scared because we’re going to come back to this in more detail. But, why is backpropagation called backpropagation? Again, this is very informal. It’s basically a practical application of Chain Rule. So you can think of a neural net of the type that I showed you earlier as a bunch of modules that are stacked on top of each other.And you can think of this as compositions of functions. And you all know the basic rule of calculus. You know, how do you compute the derivative of a function composed with another function? Well the derivative of, you know, F composed with G is the derivative of F at the point of G of X, multiplied by the derivative of G at the point X. Right. So you get the product of the two derivatives. So this is the same thing except that the functions, instead of being scalar functions, are vector functions. They get vectors as inputs and the previous vectors as outputs. More generally, actually, they take multi-dimensional arrays as input and multi-dimensional arrays as output, but that doesn’t matter. Basically, what is the generalization of this chain rule in the case of functional modules that have multiple inputs, multiple outputs that you can view as functions? Right. And, basically, it’s the same rule if you, kind of, blindly apply them it’s the same rule as you applied for, as you apply for regular derivatives. (Except here you have to use partial derivatives.) You know, what you see in the end is that if you want to compute the derivative of the difference between the output you want and the output you get, which is the value of your objective function, with respect to any variable inside of the network, then you have to kind of back, you know, propagate derivatives backwards and kind of multiply things on the way.All right. We’ll be much more formal about this next week. For now, you just know why it’s called backpropagation: because it applies to multiple layers. OK, so the the picture I showed earlier of this neural net is nice, but what if the input is actually an image? Right. So, an image, even sort of a relatively low resolution image, is typically like, you know, a few hundred pixels on the side. OK. So let’s say 256 x 256, to take a random example. OK, a car image: 256 x 256. So it’s got 65,536 pixels, times three, because you have R, G, and B components to it for, you know, you have three value for each pixels. And so that’s, you know, roughly two hundred thousand values. OK, so your vector here is a vector with two hundred thousand components. If you have a matrix that is going to multiply this vector, this matrix is going to have to have two hundred thousand rows Columns, I’m sorry. And depending on how many units you have here in the first layer, there’s going to be a 200,000 x, you know, maybe a large number. That’s a huge matrix.Right. Even if it’s 200,000 x 100,000. So you have a compression in the first layer you know, that’s already a lot ofvery, very large matrix: billions. So it’s not really practical to think of this as a full matrix. What you’re going to have to do if you want to deal with things like images is make some hypothesis about the structure of this matrix so that it’s not a completely full matrix that, you know, connects everything to everything. That would be impractical. At least for a lot of practical applications. So this is where inspiration from the brain comes back. There was some work, classic work, in neuroscience in the 1960s by the gentlemen at the top here: Hubel and Wiesel.They actually won a Nobel Prize for this in the in the 70s but their workforce from the late 50searly 60s. And what they did was that they poked electrode in the visual cortex of various animals: you know, cats, monkeys, mice, you know, whatever. (I think they like cats a lot.) And they tried to figure out what the neurons in the visual cortex were doing. And what they discovered was that so, first of all, well, this is a human brain. But, I mean, this chart is from much later. But all mammalian visual systems is organized in similar way. You have signals coming in to your eyes, striking your retina. You have a few layers of neurons in your retina in front of your photoreceptors that, kind of, pre-process the signal, if you want. They kind of compress it, because you can’t have you know the human eye is something like a hundred million pixels.So it’s like a hundred million pixel camera, megapixel camera. But the problem is you cannot have a hundred million fibers coming out of your eyes, because otherwise your optical nerve would be this big. And you wouldn’t be able to move your eyes. So those neurons in front of your retina do compression. They don’t do JPEG compression, but they do compression. So that the signal can be compressed to one million fibers. Right. You have one million fibers coming out of each of your eyes. And that makes your, you know, optical nerve about this big, which means, you know, you can carry the signal and turn your eyes. This is actually a mistake that evolution made for vertebrates. Invertebrates are not like that. Invertebrates have actuallyso, it’s a big mistake because the wires collecting the information from your retina, because the neurons that process the signal in front of your retina, the wires have to kind of it’ll be in front of your retina, and so blocking part of the view, if you want.And then they have to punch a hole through your retina to get through your brain. So there’s a blind spot in your visual field because that’s where your optical nerve punches through your retina. So it’s kind of ridiculous if you have a camera like this to have the wires coming out the front and then, you know, dig a hole in your sensor to get the wires back. It’s much better if the wires come out the back, right? And vertebrates got that wrong. Invertebrates got it right. So, you know, like squid and octopus actually have wires coming out the back. They’re much luckier. But anyway. So, the signal goes from your eyes to a little piece of brain called the lateral geniculate nucleus, which is under your brain actuallyand like at the basis of it. It does a little bit of contrast normalization.We’ll talk about this again in a few lectures. And, and, then that goes to the back of your brain where the the primary visual cortex area called v1 is. It’s called V1 in humans. And there’s something called the ventral hierarchy: V1, V2, V4 (IT), which is a bunch of brain areas going from the back to the side. And in the infero-temporal cortex right here, this is where object categories are represented. So, when you go around and you see your grandmother, you have a bunch of neurons firing that represent your grandmother in this area.And it doesn’t matter what your grandmother is wearing, you know, what what position she is in, if there is occlusion or whateverthose neurons will fire if you see your grandmother. So the sort of category-level things. And those things have been discovered by experiments with patients that had to have their skull open for a few weeks, and where, you know, people poked electrode and had them watch movies and realize this is known that turns on if Jennifer Aniston is on the movie. And it only turns on for Jennifer Aniston. So with the idea that somehow the visual context, you know, can do pattern recognition and seems to have this sort of hierarchical structure, multi-layer structure, there’s also the idea that the visual process is essentially a feed-forward process. So the process by which you recognize an everyday object is very fast.It takes about 100 milliseconds. There’s barely enough time for the signal to go from your retina to the infero-temporal cortex. It takes about, it’s a few millisecond delay per neuron that you have to go through. 100 milliseconds you barely have time to, for just you know, a few spikes to go through the entire system. So there’s no time for like, you know, recurrent connections and like, you know, etc. Doesn’t mean that there are no recurrent connections. There’s tons of them but somehow fast recognition is done without them. So this is called the the feed-forward ventral pathway. And this gentleman here, Kunihiko Fukushima, had the idea of taking inspiration from Hubel and Wiesel in the 70s, and sort of built a neural net model on the computer that had this idea that, first there were layers, But also the idea that Hubel and Wiesel discovered, that individual neurons only react to a small part of the visual field. So they poked electrodes in neurons in V1 and they realized that this neuron in V1 only reacts to motifs that appear in a very small area in the visual field.And then the neuron next to it will react to another area that’s next to the first one, right. So the neurons seem to be organized in what’s called a retinotopic way, which means that neighboring neurons react to neighboring regions in the visual field. What they also realized is that the group of neurons that all react to the same area in the visual field, and they seem to turn on for edges at a particular orientation. So one neuron will turn on for, if it’s receptive field has an edge, a vertical edge, and then the one next to it if the edge is a little slanted, and then the one next to it if the edge is a little rotated, etc.And so they had this picture of V1 basically as oriented, orientation selectivity, so neurons that look at a local field and then react to orientations. And those groups of neurons that react to multiple orientations are replicated over the entire visual field. So this guy Fukushima said, well, why don’t I build a neural net that that does this? I’m not going to, you know, necessarily insist that my system extracts oriented features, but I’m going to use some sort of unsupervised learning algorithm to to train it. So he was not training his system end-to-end. He was training it layer by layer in some sort of unsupervised fashion, which I’m not going to go into the details of. And then he used another concept from, so he used the concept that those neurons were replicated across the visual field, and then he used another concept from Hubel and Wiesel called complex cells. So complex cells are units that pool the activities of a bunch of simple cells, which are those oriented orientation-selective units. And they pull them in such a way that if an orientation, if an oriented edge is moved a little bit it will activate different simple cells, but the complex cell, since it integrates the outputs from all those simple cells, will stay activated until the edge moves beyond its receptive field.So those complex cells build a little bit of shift invariance in the representation. You can shift an edge a little bit, and it will not change the activation the activity of one of those complex cells. So that’s what we now call “convolution” and “pooling” in the context of convolutional nets. And that basically is what led me in the mid-80s or late-80s to come up with convolutional nets. So they are basically networks where the connections are local; they are replicated across the visual field; and you intersperse, sort of, feature detection layers that detect those local features with pooling operation.We’ll talk about this at length in three weeks. So I’m not going to go to into every detail. But it has, it recycles this idea from Hubel and Wiesel and Fukushima that (…if I can can get my pointer…) that, basically, every neuron in one layer computes a weighted sum of a small area of the input, and the weighted sum uses those weights. But those weights are replicated across, so every neuron in a layer uses the same set of weights. OK, so this is the idea of weight tying or weight sharing. So using backprop we were able to train neural nets like this to recognize handwritten digits.This is back from the late-80s early-90s. And this is me when I was about your age, maybe a little older. I’m about thirty in this video. And this is my phone number when I was working at Bell Labs. Doesn’t work anymore. It’s a New Jersey number. And I hit a key, and there is this neural net running on the 386 PC with a special accelerator card recognizing those characters, running a neural net very similar to the one I just showed you the animation of. And the thing could, you know, recognize characters of any style, including very strange styles, including even stranger styles. And so this was kind of new at the time because this was back when character recognition, or pattern recognition in general, were still on the model of: we extract features and then we train a classifier on top.And this could basically train the entire, like, learn the entire task from end-to-end. You know, basically, the first few layers of that neural net would would play the role of a feature extractor, but it was trained from data. The reason why we used character recognition is because this was this was the only thing for which we had data. The only task for which there was enough data was either character recognition of speech recognition.Speech recognition experiments were somewhat successful, but not as much. Pretty quickly, we realized we could use those convolutional nets not just to recognize individual characters, but to recognize groups of characters, so multiple characters at a single time. And it’s because of this convolutional nature of the network, which I’ll come back to in three lectures, that basically allowed those systems to just, you know, be applied to a large image and then they will just turn on whenever they see in their field of view, whenever they see a shape that they can recognize. So, basically, if you have a large image, you train a convolutional net that it has a small input window, and you swipe it over the entire image, and whenever it turns on it means it’s detected the object that you trained it to detect. So here the system, you know, is capable of doing simultaneous segmentation and recognition. You know, back in thebefore that people in pattern recognition would have an explicit program that would separate individual objects from their background and from each other, and then send each individual object, character for example, to a recognizer.But with this you could, you can do both at the same time. You don’t have to worry about it. You don’t have to build any special program for it. So in particular this could be applied to natural images for things like facial detection, pedestrian detection, things like this. Right. Same thing, train a convolutional net to distinguish between an image where you have a face and an image where you don’t have a face, train this with several thousand examples, and then take that window, swipe it over an image: whenever it turns on there is a face.Of course, the face could be bigger than the window so you sub-sample the image: you make it smaller, and you swipe your network again, and then make it smaller again, swipe your network again. So now you can detect faces regardless of size. OK. In particular you can use this to drive robots. So this is things that were done before deep running became popular, OK. So this is an example where the the network here is a convolutional net.It’s applied to the image coming from a camera, from a you know, running robot. And it’s trying to classify every window of a small windowlike, 40 x 40 pixels or so, even less as to whether the central pixel in that window is on the ground or is an obstacle, right. So whatever it classifies as being on the ground is green Whatever it classifies as being an obstacle is red or purple, if it’s on a foot of the obstacle. And then you can sort of map this to a map, which you see at the top.And then do planning in this map to reach a particular goal, and then use this to navigate. And so these are two former PhD students. Raia Hadsell on the right, Pierre Sermanet on the Left, who are annoying this poor robot. Pretty confident the robot is not going to break their legs, since they actually wrote the code and trained it. Pierre Sermanet is a research scientist at Google Brain in California working on robotics.Raia Hadsell is head of robotics research Director of Robotics research at DeepMind. They did pretty well. So a similar idea can be used for what’s called semantic segmentation. So semantic segmentation is the idea that you can, again, with this kind of sliding window approach, you can train a convolutional net to classify the central pixel using a window as a context. But here it’s not just trained to classify obstacles from non-obstacles. It’s trained to classify something like 30 categories. This is this is down Washington Place, I think. This is Washington Square Park. And, it you know, it knows about roads, and people, and plants, and trees, and whateverbut it finds, you know, desert in the middle of Washington Square Park, which is not… There’s no beach that I’m aware of… So it’s not perfect. At the time it was state of the art, though. That was the best system there was to do this kind of semantic segmentation. I was running around giving talks like trying to evangelize people about deep learning back then. This was around 2010. So this is before the, kind of, deep learning revolution, if you want.And one person, a professor from Israel was sitting in one of my talks. And he’s a theoretician, but he was really kind of transfixed by the potential applications of this, and he was just about to take a sabbatical and work for a company called Mobileye, which was a start-up in Israel at the time working on autonomous driving. And so, a couple of months after he heard my talk, he started working at Mobileye. He told the Mobileye people, you know”You should try this convolutional net stuff. This works really well.” And the engineers there saidNah. No, we don’t believe in that stuff. We have our own method.So he implemented it, and tried it himself, beat the hell out of all the benchmarks they had. And all of a sudden the whole company switched to using convolutional nets. And they were the first company to actually come up with a vision system for cars that, you know, can keep a car in a highway, and can break if there is a pedestrian or cyclist crossing. I’ll come back to this in a minute. They were basically using this technique. Semantic segmentation, very similar to the one I showed for the robot before.This was a guy by the name of Shai Shalev-Schwartz. You have to be aware of the fact also that back in the 80s, people were really interested in using, in sort of implementing special types of hardware that could run neural nets really fast. And these are kind of a few examples of neural net chips that were actually implementedI had to do with some of them but they were implemented by people working in the same group as I was, as I was at Bell Labs in New Jersey.So this was kind of a hot topic in the 1980s, and then of course with the interest in neural nets dying in the mid-90s people weren’t working on this anymore, until a few years ago. Now the hottest topic in in chip design in the chip industry is neural net accelerators. You go to any conference on computer architecture you know, chip, like ISSCC, which is the big kind of solid-state circuit conferencehalf the talks are about neural net accelerators. And I worked on a few of those things. OK, so then something happened, as I told you, around 2010, -13, -15 in speech recognition, image recognition, natural language processing, and it’s continuing. We’re in the middle of it now for other topics. And what happened, and I’m really sad to say it didn’t happen in my lab, but but with our friends, we started, with Yoshua Bengio and Geoff Hinton back in the early 2000swe knew, you know that deep learning was working really well and we knew that the whole community was making a mistake by dismissing neural nets and deep learning.And so we didn’t use the term deep learning yet. We invented it a few years laterso, around 2003 or so, 2004, we started kind of a conspiracy, if you want. We got together and we said we’re just going to try to, kind of, beat some records, and some data sets, invent some new algorithms that will allow us to train very large neural nets. So that will, and will collect very large data sets, so that we will show the world that those things really work, because nobody really believed it. That really kind of succeeded beyond our wildest dreams. In particular, in 2012 Geoff Hinton had one student Alex Krizhevsky, who spent a lot of time implementing convolutional nets on GPUs, which were kind of new at the timethey were not entirely new but they were starting to become really high-performance.So he was very good at sort of hacking that and and then they were able to train much larger neural nets, convolutional nets, than anybody was able to do before. And so they used it to to train on the ImageNet dataset. The ImageNet dataset is a bunch of natural photos. And the system is supposed to recognize the main objects in the photo among 1,000 different categories. And the training set had 1.3 million samples. Which is kinda large. So what they did was build this really large, and very deep convolutional net, pretty much on the same model as what we had before, implemented on GPUs, and let it run for a couple weeks.And with that they beat the performance of best competing systems by a large margin. So this is the error rate on ImageNet going back to 2010. So 2010 it was about 28% error, top 5. So, basically you get an error if the correct category is not in the top 5 among 1,000. OK, so it’s kind of a mild measure of error. 2011 it was 25.8%, the system that was able to do this was actually very, very large. It was sort of somewhat convolutional-net-like, but it wasn’t trained. I mean only the last layer was trained. And then Geoff and his team got it down to 16.4%, and then that was a watershed moment for the computer vision community. A lot of people said, Okay, you know now we know that this thing works. And the whole community went from basically refusing every paper that had neural nets in them in 2011 and 2012 to refusing every paper that does not have a convolutional net in it in 2016. So now it’s the new religion, right. You can’t get a paper in a computer vision conference unless you use ConvNets somehow. And the error rate went down really quickly, you know people found all kinds of really cute architectural tricks that, sort of, made those things work better.And what you’d see in there is that there was an inflation of the number of layers. So my convolutional nets from the 90s had 7 layers or so, and from the early 2000s. And then AlexNet had, I don’t know, 12. Then VGG the year afterward, after that had 19. GoogLeNet had I don’t know how many, because it’s hard to figure out how you count. And then the workhorse now of object recognition, the standard backbone, as people called them, has 50 layers. It’s called ResNet-50. But some, you know, some networks have 100 layers or so. So Alfredo a few years ago put together this chart that shows where each of those blob is a particular network architecture. And the x-axis is the number of billions of operations you need to do to compute the output. Okay, those things are really big billions of connections. The y-axis is the top one accuracy on ImageNet. So it’s not the same measure of performance as the one I showed you before. So the best systems are around 84, today. And the size of the of the blob is the memory occupancy, so the number of millions of floats that you need to store to store the weight, the weight values. Now people are very smart about compressing those things like you know quantizing them, and there’s entire teams at Google, Facebook, and various other places that only work on optimizing those networks and compressing the things so they can run fast. Because, to give you just a rough idea, the number of times Facebook, for example, runs a convolutional net on its servers per day is in the tens of billions.Okay. So there’s a huge incentive to optimizing the amount of computation necessary for this. So one… one reason why convolutional nets are being so successful is that they exploit a property of natural data, which is compositionality. So compositionality is the property by which a scene is composed of objects; objects are composed of parts; parts are composed of sub-parts; sub-parts are really combinations of motifs; and motifs are combinations of contours or edges, right, or textures. And those are just combinations of pixels. Okay, so there’s this so-called compositional hierarchy that, you know, particular combinations of objects at one layer in the hierarchy form objects at the next layer. And so if you, kind of, mimic this compositional hierarchy in the architecture of the network, and you let it learn the appropriate combinations of features at one layer that, you know, form the the features of the next layer, that’s really what deep learning is. Okay. Learning to represent the world and exploit the structure of the worldand the world, being the fact that there is organization in the world because the world is compositional. A statistician by the name of Stuart Geman, who is at Brown University, said so he was kind of playing on the famous Einstein quote, Einstein said: The most incomprehensible thing about the world is that the world is comprehensible. Like, among all the complicated things that the world, you know, the world could be extremely complicated, so complicated that we have no way of understanding it. And it looks like a conspiracy that we are able to understand at least part of the world. And so Stuart Geman’s version of this is that the world is compositional… or there is a God. (Because you need supernatural things to be able to understand it if the world is not compositional.) So this has led to incredible progress in things like computer vision, as you know, from you know, being able to unreliably identify, you know, detect people, to being able to generate masks for every object, accurate masks, and then even to figure out the pose, and then do this in real time on a mobile platform, you know. I mean the progress has been sort of nothing short of incredible, and most of those things are based on two basic families of architectures.This sort of so-called one-pass object detection/recognition architectures called RetinaNet, feature pyramid network. There’s various names for it. Or U-Net. Then another type called Mark-RCNN, both of them actually originated from Facebook. Or, the people who originated them are now at Facebookthey sometimes came up with it before they came to Facebook. But, you know, those things work really well, you know, they can do things like that: detect objects that are partially occluded, and, you know, draw a mask of every object. So basically, this is a neural net, a convolutional net where the input is an image.But the output is also an image. In fact, the output is a whole bunch of images, one per category. And for each category outputs the mask of the object from that category. Those things can also do what’s called “instant segmentation.” So if you have a whole bunch of sheeps, it can tell you, you know, not just that this region is sheep, but actually pick out the individual sheeps and will tell them apart, and it will count the sheeps right, and fall asleep. That’s what you’re supposed to do right, to fall asleep you count sheeps, right? And the cool thing about deep learning is that a lot of the community has embraced the whole concept that research has to be done in the open. So a lot of the stuff that we’re gonna be talking about, as you probably know, in the class is it’s not just published, but it’s you know, published with code. It’s not just code, it’s actually pre-trained models that you can just download and run. All open source. All free to use.So that’s really new. I mean people didn’t use to do research this way, particularly in industry. But even in academia people weren’t used to kind of distributing their code. But deep learning has sort of, somehow the race has kind of driven people to kind of be more open about research. So there’s a lot of applications of all this, as I said, you know self-driving cars. This is actually a video from Mobileye, and Mobileye was pretty early in this in using convolutional nets for autonomous driving. To the point that in 2015 they had managed to shoehorn a convolutional net on the chip that they had designed for some other purpose. And they sold the licensed the technology to Tesla. So the first self-driving Tesla’s, I mean self-driving, not really self-driving, they have driving assistance, right.They can keep in lane on the highway and change lanehad this Mobileye system. And, and that’s, that’s pretty cool. So that’s a convolutional net. It’s a little chip that is, you know, just behind the, it looks out the window and it’s behind the the rear-view mirror. Since then thisyou know, four/five years agosince then this kind of technology has been very widely deployed by a lot of different companies. Mobileye still now was bought by Intel, and they have like 70 or 80 percent of the market for those vision systems. But, but there is a lot of companies thatand car manufacturers that use those things.So in fact in some European countries, every single car that comes out, even low-end cars, has convolutional-net-based vision systems. And they call this: emergency, emergency, sort of, advanced emergency braking system or automated emergency braking system. AEBS is deployed in every car in France, for example. It reduces collisions by 40%. So not every car on the roads have them yet, because you know people keep their cars for a long time. But what that means is that it saves lives. So a very positive application of deep learning. Another big category of applications, of course, is medical imaging. So this is a, it’s probably the hottest topic in radiology these daysis how to use AI (which means convolutional nets) for radiology. This [slide image] is lifted from a paper by some of our colleagues here at NYU, where they analyzed MRI images. So there’s one big advantage to convolutional nets: it’s that they don’t need to look at the screen to look at an MRI. In particular to be able to look at an MRI, they don’t have to slice it into 2D images.They can look at the entire 3D volume. This is one property that this thing uses. It’s a 3D convolutional net that looks at the entire volume of an MRI image and then produces, you know, it uses the very similar technique for, as I was showing before, for semantic segmentation. And it produces, it basically turns on on the output image wherever there is some, you know, here a femur bone, but you know, it could be so this is the kind of result it produces. It works better in 3D than in 2D slices. Or, it can turn on when it detects malignant tumor in mammograms. (This is to 2D, it’s not 3D.) And there’s, you know, various other projects in medical imaging that are going around. Okay. Lots of applications in science and physics, bioinformatics, you know, whatever, which we’ll come back to so… Okay. So there’s a bunch mysteries in deep learning.They’re not complete mysteries, because people have some understanding of all this, but they are mysteries in the sense that we don’t have, like, a nice theory for everything. Why do they work so well? So one big question that theoreticians were asking many years ago, when I was trying to convince the world that deep longing was a good idea, was that they would tell me: Well, you can approximate any function with just two layers, why do you need more? And I’ll come back to this in a minute. What’s so special about convolutional nets? I talked about the compositionality of natural images, or natural data, in general.This is true for speech also and values of the signals, natural signals. But it seems a little contrived. How is it that we can train the system, despite the fact that the objective function we’re minimizing is very non-convex. We may have lots of local minima. This was a big criticism that people were throwing at neural nets, people who’d never played with neural nets were throwing out neural nets back in the old days. Say, like you know, you have no guarantee that your algorithm will convergeyou know, it’s too scary. I’m not gonna use it. And, the last one is: why is it that the way we train neural nets breaks everything that every textbook in statistics tell you? Every textbook in statistics tells you, if you have n data points, you shouldn’t have more than n parameters, because you’re going to overfit like crazy. You know, you might regularize. If you’re a Bayesian, you might through a prior. But… (which is equivalent) But what guarantee do you have? And with neural nets, neural nets are wildly over parametrized.We train neural nets with hundreds of millions of parameters, routinely. They’re used in production. And the number of training samples is nowhere near that. How does that work? But it works! OK, so things we can do with deep learning today: You know, we can have safer cars; we can have better medical analysis, medical image analysis systems; we can have pretty good language translation, far from perfect, but useful; stupid chatbots; you know, very good information search retrieval and filtering. Google and Facebook nowadays are completely built around deep learning. You take deep learning out of them and they crumble. And, you know, lots of applications in energy management and production, and all kinds of stuff; manufacturing, environmental protection.But we don’t have really intelligent machines. We don’t have machines with common sense. We don’t have intelligent personal assistants. We don’t have, you know, smart chatbots. We don’t have household robots. You know, I mean, there’s a lot of things we don’t know how to do, right. Which is why we still do research. OK, so, deep learning is really about learning representations. But really we should know in advance what representations are. So I talked about the traditional model of pattern recognition. But… Representation is really about you know, you have your raw data, and you want to turn it into a form that is useful, somehow. Ideally, you’d like to turn it into a form that’s useful regardless of what you want to do with it. Sort of “useful” in a general way. OK, and it’s not entirely clear what that means. But, at least, you want to turn it into a representation that’s useful for the task that you are envisioning.And there’s been a lot of ideas over the decades on, sort of, general ways to pre-process natural data in such a way that you produce good representations of it. I’m not going to go through the details of this laundry list. But the things like tallying the space, doing random projection. So random projection is actually, kind of, you know, like a monster that rears its head periodically, like every five years. And you have to whack it on the head every time he pops up. That was the idea behind the Perceptron. So the first layer of a perceptron is a layer of random projections. What does that mean? A random projection is a random matrix, which you know, has a smaller output dimension than input dimension, with some sort of non-linearity at the end, right.So think about a single layer neural net with nonlinearities, but the weights are random. So you can think of this as random projections. And a lot of people are rediscovering that wheel periodically, claiming that it’s great because you don’t have to do multi-layer training. And so it started with the Perceptron, and then you know it came back in the 60s, and then it came back again in the 1980s, and then it came back again. And now it came back. There’s a whole community, mostly in Asia. They call two layer neural nets, where the first layer is random, they call this “extreme learning machines,” OK. It’s like, it’s ridiculous, but it exists. They’re not “extreme,” I mean they’re extremely stupid, butyou know. Right, so I was mentioning the compositionality of the world. It’s, you know, from pixels to edges to texton, motifs, parts, objects. In text you have characters, word, word groups, clauses, sentences, stories. In speech it’s the same, you have individual samples. You have, you know, spectral band, sound, phone, phonemes, words, etc.You always have this kind of hierarchy. OK, so here are many attempts at dismissing the whole idea of deep learning. OK, first, first thing. And this is things that I’ve heard for decades, OK from mostly theoreticians, but a lot of people, and you have to know about them because they’re going to come back in five years when people say, “Oh, deep learning sucks.” Why not use support-vector machines? OK, so, here is support-vector machines here on the top left. Support-vector machine is a, and I’m sure many of you have heard about kernel machines and support-vector machines. Who knows what this is? I mean, even if it’s a rough idea what this is. OK, a few hands. Who has no idea what a support-vector machine is? Don’t be shy.Yeah. Yeah, I mean it’s okay if you don’t. OK, like most people haven’t raised their hands for either. [Alfredo: Hands up, please, who knows support-vector machines.] OK, come on all the way up. Cool, all right. Who has no idea what it is? Don’t be shy, it’s okay. [Alfredo: inaudible] All right, good. Right, so here’s the support-vector machine. Support-vector machine is a two layer neural net. It’s not really a neural net, people don’t like when it’s formulated this way, but really you can think of it this way. It’s a two layer neural net, where the first layer, which is symbolized by this function K here, each unit in the first layer compares the input vector X to one of the training samples X^i’s. OK, so you take your training samples, let’s say you have a thousand of them so you have a thousand X^i’s, from i = 1 to 1,000, and you have some function K that is going to compare X and X^i. Good example of a function to compare the two is you take the dot product between X and X^i, and you pass the result through, like, exponential minus square or something. So you get a Gaussian response, as a function of the distance between X and X^i, OK. So it’s a way of comparing to two vectors, doesn’t matter what it is. And, then you take those scores coming out of this K function that compares the input to every sample and you compute a weighted sum of them. And what you’re going to learn are the weights, the alphas. Okay, so it’s a two-layer neural net in which the second layer is trainable and the first layer is fixed. But in a way you can think of the first layer as being trained in an unsupervised manner, because it uses the data from the training set, but it only uses the X’s, doesn’t use the Y’s.It uses the data in the stupidest way you can imagine, which is you store every X and use every single X as the weight of a neuron, if you want. Okay. That’s what support vector machine is. You can write a thousand page book about the cute mathematics behind that. But the bottom line is it’s a two-layer neural net where the first layer is trainied a very stupid way unsupervised, and the second layer is just a linear classifier. So, it’s basically glorified template matching, because it basically compares the input vector to all the all the training samples. And so, it doesn’t work if you want to do like, you know, computer vision with raw images. If X is an image and the X^i’s are a million images from ImageNet first of all, for every image you’re gonna have to compare it with a million images, or maybe a little less if you’re smart, and how you train it.That’s going to be very expensive, and the kind of comparison you’re making is basically what solves the problem. The weighted sum you’re gonna get at the end is really the cherry on the cake. I use that analogy too often, actually. So… You can approximate, you can have theorems that show that you can approximate any function you want, as close as you want, by tuning the K function and the alphas. And so, if you were to talk to a theoretician, they’ll tell you: Why do you need deep learning? I can approximate any function I want with a kernel machine.The number of terms in that sum can be very large, and nobody tells you what kernel function you can use. And so, that doesn’t solve the problem. You can use a two-layer neural net, OK. So this is the top right here. The first layer is a nonlinear function F applied to the product of a matrix W^0 by input vector; and then the second layer multiplied by the second matrix; and then passes it through another non-linearity. OK, so this is a composition of two linear and non-linear operations. Again, you can show that under some conditions you can approximate any function you want with something like this. Given that you have a large enough vector in the middle. OK, so the dimension of what comes out of the first layer, if it’s high enough potentially infiniteyou can approximate any function you want as close as you want, by making this layer go to infinity. So again, you talk to theoreticians and they tell you: Why do you need layers? I can approximate anything I want with two layers.But there is an argument, which is it could be very, very expensive to do it in two layers. And… For some of you this may sound familiar. For most of you probably not. Let’s say I want to design a logic circuit. OK, so, when you design logic circuits, right, you have AND-gates and OR-gates and… or NAND-gates, right. You can do everything with just NAND’s, right, negative ANDs. And if you.. You can show that any Boolean function can be written as a bunch of ORs on a bunch of, a bunch… you know, a bunch of ANDs and then an OR on top of this.That’s called disjointed normal form (DNF). So any function can be written in two layers. The problem is that for most functions, the number of terms you need in the middle is exponential in the size of the input. So, for example, if I give you N bits, and ask you to construct a circuit that tells me if the number of bits that are on in the input string is even or odd. OK, it’s a simple Boolean function: 1 or 0 on the output. The number of gates that you need is essentially exponential, in the middle. If you do it in two layers. If you allow yourself to do it in log(N) layers, where N is number of input bits, then it’s linear. OK. So you go from exponential complexity to linear complexity if you allow yourself to use multiple layers. It’s as if, you know, when you write a program I’ll tell you write the program in such a way that there is only two sequential steps that are necessary to run your program.So basically your program has two sequential instructions. You can have, you can run as many instructions as you want in your program but they have to run in parallel, most of them. And you’re only allowed two sequential steps. OK. And, the kind of instructions you have access to are things like, you know, linear combinations, nonlinearities like simple things, right. Not like entire sub-programs. So for most, most problems, the number of intermediate values you’re going to have to compute in the first step is going to be exponential in the size of the input.There’s only a tiny number of problems for which you’re going to be able to get away with a non-exponential number of minterms. But if you allow your program to run multiple steps sequentially then all of a sudden, you know, it can be much simpler. It will run slower, but it will take a lot less memory. It will take a lot less stuff resources. So people who design computers circuits know this, right. You can design, for example, a circuit that adds two binary numbers. And there is a very simple way to do this, which is that you first you take the first two bits, you add them, and then you propagate the carry at the second bit, the second pair of bits, you know, taking the carry into account that gives you the second bit of result, and then carry, you know, propagate the carry, and then you do this sequentially, right. So problem with this is that it takes the time that’s proportional to the size of the numbers that you’re trying to add.So circuit designers have a way of, basically, pre-computing the carry, it’s called carry lookahead, so that the number of steps necessary to do an addition is actually not N, it’s much less than that. OK. But that, that’s at the expense of a huge increase in the complexity of the circuit. The the number, like the area that it takes on the chip. So this exchange between time and space, or, between depth and and… and, kind of, time is is known.So what do we call deep models? So, you know, a two-layer neural net, one that has one hidden layer, I don’t call that “deep,” even though technically it uses backprop. But, eh, you know, it doesn’t really learn complex representations. So there’s this idea of hierarchy in deep learning. SVMs definitely aren’t deep. Unless you learn complicated kernels, but then they’re not SVM’s anymore. So what are good features? What are good representations? So, here’s an example I like. There is something called the manifold hypothesis, and it’s the fact that natural data So, if I take a picture of this room with, you know, a camera with a 1,000 x 1,000 pixel resolution. That’s 1 million pixels at 3 million values.It leaves, you can think of it as a vector with 3 million components. Among all the possible vectors with 3 million components, how many of them correspond to what we would call natural images? We can tell when we see a picture whether it’s a natural image or not. We have a model in our visual system that tells us this looks like a real, like a real image. And we can tell when it’s not. So the number of combinations of pixels that actually are things that we think of as natural images is a tiny, tiny, tiny, tiny, tiny subset of the set of all possible images. There’s way more ways of combining random pixels in nonsensical images than there are ways of combining pixels into things that look like natural images. So the manifold hypothesis is that the set of things that, you know, look natural to us live in a low-dimensional surface inside the high-dimensional ambient space.And a good example to convince yourself of this: Imagine I take lots of pictures of a person making faces, right. So the person is in front of a white background. Her hair not moving. And she, kind of, moves her head around and, you know, makes faces, etc. The set of all images of that personso I take a long video of that personthe set of all images of that person lives in a low dimensional surface.So a question I have for you is, What’s the dimension of that surface? Whatever magnitude, okay. Any, yes? [Inaudible student comment.] Yeah, you’ve probably heard my spiel before, but… [Speaker: What did the person say?] Huh? [Speaker: What did they say?] OK, so for whoever hasn’t heard this, you have a shot, another shot at an answer. OK, any guess? No? Don’t be shy. I want like multiple proposals. Anyone. You can look down your laptop, but, you know, I can point at you or something. OK, any idea? Yes. No idea? It’s OK. You, any idea? Maybe you heard what he said. [Inaudible student comment.] Linear, what does that mean? [Inaudible student comment.] It’s a 1D space. OK, a one-dimensional subspace. OK, any other proposal? Any idea? OK, the images I’m taking are a million pixels. OK, so the ambient space is 3 million dimensions. [Inaudible student comment] They don’t change, no. And the person can move the head, you know, turn around, things like this. But not really move the whole body. I mean you only see the face, it’s mostly centered.[Student: A thousand.] A thousand, OK. Why? [Inaudible student comment.] OK yeah, that’s a good guess. At least the motivation. [Inaudible student comment.] Say again. [Student: The surface area of the person.] The surface area of the person. Right. So it’s bounded by the number of pixels occupied by the person. That’s for sure. That’s a, that’s an upper bound. Yes. Those pixels, of course, are not gonna take all possible values. So that’s a wide upper ground. Any other idea? OK. So, basically the dimension of that, as you said, Is bounded by the number of muscles in the face of the person.Right. The number of degrees of freedom that you observe in that person is the number of muscles in their face. The number of independently movable muscles, right. So there’s 3 degrees of freedom due to the fact that you can tilt your head this way, that way, or that way. That’s 3, right there. Then there is translation, this way, that way. Maybe this way and that way, maybe up or down. That’s 6. And then the number of muscles in your face, right. So you can smile. You can, you know, pout. You can do all kinds of stuff, right. And you can do this, you know, independently. You close one eye. You can smile in one direction, you know, I mean… So, whatever independent muscles, you havenot counting the tongue, because there’s tons of muscles in the tongue.And that’s about 50. Maybe a little more. So, regardless, it’s less than 100. OK, so the surface, locally, if you want to parameterize the surface occupied by all those picturesmove from one picture to another it’s a surface with less than 100 parameters that determine the position of a point on that surface. Of course it’s a highly nonlinear surface. It’s not like this beautiful Calabi-Yau manifold here, but but it’s a it is a surface nonetheless. Of course the answer was in the slide so, you know. So what you’d like is an ideal feature extractor to be able to disentangle the explanatory factors of variation of what you’re observing. Right. So the different aspects of my face, you know, it’s not just I move my muscles and I move my head aroundeach of those is an independent factor of variation.Again I can also remove my glasses. You know, the lighting could change. That’s another set of, you know, variable variables. And what you’d like is a representation that basically individually represents each of those factors of variations. So if there is a criterion to satisfy in learning good representations it’s that: it’s finding independent explanatory factors of variation of the data that you’re looking at. And the bottom line is that nobody has any idea how to do this. OK. But that would be the ultimate goal of representation learning. And we basically are at the end. OK. I’ll take two more questions, if there is any. Yes. [Inaudible student question.] OK, so the question is: Is there some sort of pre-processing like PCA that will find those vectors? Yeah, so PCA will find those if the manifold is linear.So if you assume that the surface occupied by all those examples or faces is a plane, then PCA will find the dimension of that planeprincipal component analysis, right. But, no, it’s not linear unfortunately, right. Let me… Yeah, let me give you an example. If you take me and my oldest son that looks like me, and you place us making the same face in the same position, the distance between our images will be relatively small even though we’re not the same person. Now if you take my face and my face shifted by 20 pixels, there’s more distance between me and myself shifted than there is between me and my son, OK. So… What that means is that, you know, the manifold of my face, you know, is some complicated manifold in that space. My son is a slight different manifold which does not intersect mine. Yet these two, those two manifolds are very close to each other, and they’re closer to each other than any two samples from my manifold, and two samples from his manifold. So PCA is not going to tell you anything, basically.OK, here is another reason why that surface is not, is not a plane. You’re looking at me right now. Now imagine the manifold, which is a linear manifold, one dimensional manifold, of me turning my head all the way 360. OK. That manifold is topologically identical to a circle. It’s not flat. Can’t be, it can’t be aligned. So PCA is not going to find it. OK, I gotta blast off. Thanks! See you next week..

As found on YouTube

More here..

Add Comment