[Music] hello everyone uh let's get started so uh first of all uh this is mainly for the residents so i'll be uh essentially uh giving uh they didn't have many of talks about neural networks before things like that so i'll have to give a an introduction about these things before uh so uh bear with me a few minutes um feel free to ask questions at any point when you have them and uh yeah let's get started so uh i'm iltes and uh hopefully you're here because you're interested in learning about graphical networks the first thing let's talk some high-level background machine learning uh in general and we'll get to graphical networks in five minutes what's machine learning well we designed the model we the humans designed the model and the model has parameters the parameters are these knobs that you turn slightly left or right and change the model's behavior the model is something you want to represent and if something in the world all the models are wrong some are useful so this is what machine learning hopes to do is to we design the models as humans we learn the parameters through data so supervised machine learning which is the most common and most widely used and the thing that works the most we have our model our spherical cow and we get the input data we try to predict something out our data set is this the data points which have input x and output y and these data points are independently come out from this for your population so that we assume there is no correlation among them in most cases what do we want to do is we have this loss function something that measures how good or bad our predictions are we start with our model f we give it x and we try to predict y and we judge how close our prediction f is close to y so that's supervised learning one flavor which is common in deep learning nowadays is the following we do we do the learning with gradient descent the idea is as follows while we have not converged somehow we compute an estimate of the derivative of the loss over our data set and then we update our parameters this theta thing based on how we're going to minimize the loss so we gradually move to a point which is hopefully good this is a computational method to do this not the only one but in any case that's it what do we want to hope to get out from from machine learning and these modeling things well we want to generalize to new previously unseen data if we had these data points here we don't want to have a very simple model that is under fitting we don't have a very complex model that overfits exactly in our points but doesn't model the thing we're trying to do so that's machine learning if you've uh haven't had this this seen this before hopefully this is good enough to get started so other things that we're going to use the notion of distributed vector representations that's common in natural language processing common in deep learning the idea is the following we have a set of discrete things the one way to represent this is with local representations we have a huge sparse vector of mostly zeros and a single one when we represent something like something like let's say a banana or a mango or a dog this however has the problem that we don't know how these two things uh correlate or combine and it's a very discrete representation what happens usually in deep learning somehow different models do different things learn a distributed vector representation distributed because the meaning is distributed across the components of this vector so maybe banana and mangoes share this yellowness thing whereas the dog is somewhat different so how do you do that how do you learn these things that you can compute them you can learn them as parameters of your model the high level idea is you can start from these representations the one hot representations multiply them with this embedding matrix and you get these distributed vectors and now your parameter is this e here which is an embedding maybe you've heard things like word to vac or things like that these are the distributed vector representations they are building blocks let's say of the things that we are trying to get at so how you compute them or how you learn them it's in the it's quite wide varying ways of doing this but uh you'll hear me talking about distributed vector representations uh a few times later d is a lot smaller than b exactly that's uh that's part of the point where the v is you have 10 000 elements or 10 000 words or something like that d would be 128 or 500 or something like that so it compresses down the space in some form that's another way of seeing these things so uh graph notation again uh for the residents most of you don't necessarily have a computer science background graphs look like this you have nodes or vertices you can see here you have edges which can be direct directed or undirected edges these are uh you can see them here you can have many different types of edges so here i'm coloring some edges with a different color to say this what does this graph say well in this specific one nothing but the main idea here is that you encode the nodes which are elements and the edges encode relationships among among those elements so this is the high level thing and you can encode quite a few things in graphs hopefully if the resins have any questions about notation or graphs let me know very simplified i think there are tons of course of things around graphs okay so we got to the point where i can start talking about graph neural networks so the idea of graphical networks is the following you start with a graph a representation of your problem what is this graph where does it come from this is a modeling decision that's something that you decide that this graph here encodes the problem you're interested in it could be a social network it could be a molecule it could be a program it could be anything that you're interested about so that's the first part you as a human have designed a graph representation of your problem the second thing is that for each of your nodes here you need some information this information is encoded in distributed vector representation and embedding if you wish and this is encoded for each node so each node gets somehow a representation you can compute it it could be an image so b could be an image so you compute it from a convolutional neural network it could be a word so it could be an embedding but it could be anything that uh that you wish it to be so this is where we start this is the input to a graph neural network now the question is how what we want to do well the idea is we start from here uh this graph representation will describe for most of the rest of the this hour the graphical networks the output here is going to be again the same graph but for each node you have a vector representation and now these vectors here are going to have information not about node a when you had initially here but how a belongs within the context of this graph so you're essentially saying given the features that i had initially here and the structure of the graph and the neighbors and what the neighbors had to say or do and what is the representation here so you compute a gnn computes these representations once it's trained and then what you do is you take this this representation you have of your graph the output of the gnn and then you can pass it to something that is specific to the task that you're trying to make so it could be the loss or something else but it's something that it's task specific and we'll discuss very briefly two tasks uh after i explain uh the graph uh graph journal networks are these funny things attached to each each vertex they're sort of somehow meaningful to the person yes starting that's not the internals of the gnn that's it's a meaningful thing it's well as meaningful as a vector can be right it's uh but it comes from outside it's not part of the mechanism of machine learning it's not part of the mechanism of graphene networks it could be a neural network somehow processes f let's say an image and the image is kind of you get a vector out of it which could be the features that there is an edge here or there is a face here there is a wheel here something like that it could be that this is a word so you kind of uh have something uh distribute an embedding of a word like a natural language processing world and and things like that so it's the gnn is transpiring to what the initial representation of things are they the same should i think of them as the same kind of thing and they're the same length for example you know they don't have to be the same length but they are uh they usually are the same length not the same vector of course otherwise this would be an identity function uh but they are a representation again it's like a feature vector of uh this con this uh node a is uh uh is central to this in this part or uh this uh if it were a variable in the program that we'll see later this variable here is acts like a counter people accumulate things into it things like that that are learned we don't explicitly say or know exactly what this is we can find by analogy in some cases but that's that's all do they have to be unique and identifying so for instance if you have the same type of nodes but with different edges you would have the same node identifier but the edges will be would make it different if i have two vectors that are identical in the same graph i can assume that they are the same type of node so here yes yes you can you which means if i have two the same like they don't clash like md5 thing you can assume that you can assume anything you want in this case yes so the the simple the simplest scenario here is that all these things all these vectors are identical and you want to learn about something about the graph itself that's perfectly fine okay there are other cases where you want to attach some information to them nick so i'm trying to map this diagram to kind of the first slide you showed about what machine learning is which is you know you have the model and then you have the the knobs right so is it accurate to say that the model here the human design part is the structure of the graph and then the knobs that you're learning trying to learn the parameters for it are the um the parameters that sorry the human element is in here we'll discuss this in the next slide and also you can think of it as a feature extraction right in the let's call it old-fashioned machine learning you extract features is this red or blue is this uh small or large so you have zero one things instead here you can think of it as a feature engineering feature engineer your problem into a graph and so this is part of what we humans do but also like there is components in here that are human related but the parameters most of them are here there might be parameters how to compute these initial vector representations got it thanks yeah uh should we assume that the graphs were perhaps different examples are variable size or not necessarily the same structure yes and if so okay then how after you get the final representation then you have like these vectors that you would like to combine into your logs right so then how do you like you have a variable size we'll get there eventually now does that output graph have the same structure as the input graph yes so we're not adding or removing any nodes or edges not in the flavor of the graphene letters we'll be explaining for the rest of the talk in principle you can do some of these things but let's ignore this for for all practical purposes yeah okay so hopefully that's uh that's good enough graphical networks here let me zoom in in this box what happens is the following let's take this small example here this uh f uh node is connected e and d connect to f through different types of edges at some point in time let's say uh e and d have some vector representations the same thing happens with f now the idea is this following we start from our uh for our neighbors and we're going to do something like computing a message that's the analogy the message here is essentially yet another vector the idea is that we get a function of the input the node the neighboring node the actual current node and the type of the edge and with this function which could be many things again that's design decisions about our model we'll discuss concrete a few concrete options later we get a message meaning again a vector the same thing for this uh for this edge here then we summarize all the messages we received from our direct neighbors and again summarizing can be many different things we combine the summarized information with the current state of the node and we update the node so now f essentially what happened is it had a state a time t minus 1 a time t now it has a new state that has information about itself and how and from its neighbors in some way this is combined there are many many different ways to combine them we'll get to that very soon yes uh when you say uh it's a function of both nodes and also the type of edge yes is that encoded somehow or is it just directed undirected so no the type of edge could be that d and f uh let's say are then fd is below f and e is above f let's say so it's a it's a relationship that you encode again the num the types of edges you have are is something that you decide it's again the model the modeling decision of of the human but then it's some call it directly in the center of the edges we'll discuss how and when it's encoded in a second so let's look at the at a function i'll add an equation here which is hopefully colored in a helpful way so prepare the messages this is a function uh which might be depend on the time step t that takes the representation of uh of your node that you're currently trying to to to predict to predict like f in this case uh it takes the type of the edge so this k here and the and optionally uh the type the uh and the information the state the vector if you wish of your uh of your input so you do this you compute a message for each edge for each edge that connects for each node that connects to nj now i'm using this fancy operator here mostly because i would i couldn't find another operator in office but the idea is as follows this could be any uh commutative operation any permutation invariant operation for combining the messages it doesn't matter if d and f the the edge from d to f comes first or the edge from e to f comes first the this is an important part of of graphical networks because they make them essentially invariant to however you may shuffle your graph and that's that's an important thing in some cases this could be a summation in some cases this would be a max in some cases this could be something else that's again a design decision of our model nothing uh nothing beyond that but this is a general class of operations you should include all the edges yes but you include all the edges that connect yes yes yes yeah and we're currently here updating only one node of course so we get all our neighbors we compute a message we summarize it and then we take our own state this state here and we combine this with some function q and we now have a new state over there okay so that's uh that's the high level thing for each node you can imagine we have a previous state you get messages from your neighbors and uh you update your state should just be going from or did i miss something in rotation uh depends on annotation but uh under the big union yeah yeah it's uh i mean you yeah i guess you want nj should have been here so this is the process for updating one node how does this uh most of the common graphical networks work well you have one of this the parameters are shared across all this for each node and there is a clock like very much like a cpu at each point a clock ticks when it ticks every node gets input from all its neighbors computed messages updates the state the clock ticks again and the process repeats again and again so how many times do you run this that's a hyper parameter a design decision however you want to call it there is no notion of convergence or other fancy things like that but that's uh that's how a graph neural network works but the new computed value has to be saved separately so you only update all of the values at the end because otherwise the order in which you open the the the nodes will change yes you do this in parallel so which brings me to the next slide you can think of this as unrolling the the graph multiple uh multiple thing ways in time and now each vector will get updated again and again and again so all nodes update their states simultaneously in parallel each slice here you see all the nodes in the graph have one state and as time goes on what are you how are these f functions involving as well by f functions you mean the ones that you're computing so once you have inferred your once you've trained your neural network uh just show the next slide yeah i'm trying to get there yeah yeah you you can i guess modeling decision you might decide that f is a different uh is a different uh for a it's dependent on t you might decide that q is dependent on t uh in many cases they don't in some cases they do that's entirely up to you uh but usually in some cases you can think of this as uh uh as something that depends on the time t but it's shared across all time slices so the thing i'm missing out in this expression is the parameters right so both uh f and q get theta as an input as well yeah so parameters can be in f and there's really ft of theta and there's other things and it's q of yeah and we'll discuss i think in two or three slides there are concrete choices that people make where the parameters will be more explicitly um okay so what this means is that at in the beginning each node knows about itself the next step each node learns about itself well we knew that already and it's a neighbor one uh distance one neighbors and the next step they also they their neighbors are have also been updated so now each neighbor know about itself about its neighbors and its neighbors neighbors so you increase essentially this field perceptive field of what you know about each node and by more time steps hopefully you learn more about how you belong in the graph is this like spanning tree protocol in networking i don't think so i i cannot say that i'm very familiar with that yeah okay no um so uh graphical networks you started from some state you got the output now you have all these vectors here that are about information about the node and how it belongs in there and now you can do anything you actually care about in your application you can say i want to pick one node from this graph that has some property so you can do that this is node selection you can label each node you can add some additional class for example and you take each representation here and classify it somehow or you can decide to summarize all these vectors you have here and do graph classification so there are different ways to do this it's very application dependent we'll discuss when we go to some applications okay so let's take a very very simple example not very practical but still it's the simplest possible you can get so you have this age where you have you got out from each node at the final time step and you multiply it with a single layer mlp and what you get out of it is a binary decision a zero one assuming that this makes sense for your application right uh so uh now you have a loss essentially saying that uh the classical binary cross entropy loss uh saying that i want my decision for the each node here to match my value y and whether it's true or false zero or one what's the application i don't know i just made they made this just to make this a concrete thing now what this means is that i have a loss i can compute its gradient and with automatic differentiation we compute the gradient through the graphical network and i can update the parameters here the w everything the gnn everything in the initial layer and this is how learning with graphical networks would work but so is this a step that's done so your graph neural network produces this output and from this output you then train a classifier based on it or is it something that you incorporate into the graphic it's all end to end right so you can think of this as a function uh the loss for a single example at least over the parameters of all the parameters you have in your model and you back propagate through to all the parameters in most cases it's the most common thing to do but you're doing the propagation it's over the entire stack so it's a pretty giant expression that you're differentiating yes you wouldn't want to do the differentiation manually that's that's that's for sure yeah also do you run into problems like like scrolling gradients and so on like this this sounds like a very hard rnn problem and our answer it depends on the architecture again there are many architectures that are coming up next some of them look more like cnns and you can think of them as a stack of cnn layers some of them are rnn's usually ds unfolding happens for 10 steps let's say so it's you never get into that big of a problem and again you use an rnn which is like a gr2 or something like that so that or you use residual layers like in cnns and hopefully this you don't run into this problem in most cases uh where's the update rule for the for these age states uh what do you mean the update rule is stock has to grade in this end or with momentum or uh what how do you compute how do you go from h t to h t plus one hd so the hd you mean hd is uh in an a value in the model right you don't update hd it's not a parameter w is a parameter no i i get that but um okay maybe you may say it in a second i'm not sure what you're asking that's the earliest you said there would be an update rule for the for each of these states for each of these h vectors so h here is the output h so maybe this t should have been capital d the final thing when you go over there right yeah yeah okay that's that depends on the graphene network architecture so let's look at the uh two popular graph neural network architectures so the first one is uh gated graphical networks it was first embedded in this building uh i wasn't here at that time but the idea is following we want to compute the messages we say that a message depends on the type of the edge k on the preview on the neighbor and the state of the neighbor d so age of node d at time t minus one this is comment based of course so that's equally wrong and it does not depend on f in this case so here this is e is a parameter of the model and you do a matrix multiplication you transform the state the input state of your neighbor with e your summarization your permutation variant method is just a summation so however you order this sum it doesn't really matter it's a sum well modular floating point but don't worry and then here you get your previous state yeah you have an rnn cell a recurrent neural network this one is called gru doesn't matter it has some parameters in it and it takes the previous state the message that you just computed that the summarized information and updates its state k is the type of the edge here so you can say that the white one is zero the red one is one something like that so you have a finite and usually small set of uh of edges and i guess your f depends on the type of patch the function is the function f yes so that's where the this is what previously we had an f as f here so we have essentially a different matrix for each k which is an n by n where n is the dimensionality of h oh there's no bias there for ek yeah you could add one uh people it depends on your application there are variants without this i think the original ggnn does not one you can say that a bias essentially will be helpful for counting how many edges come in in some place that's one option okay next oh nick uh what does the gate referring to sorry this what's the gated report using the gru so this is an rnn that says uh your previous state to next state so this goes back to you have as much exploding or imploding gradients as much as you would have for a jiru with 10 steps so not much usually another variant uh that is the graph convolutional neural networks and that's because they resemble convolutions and we'll see later how this this goes but it is you take the states from the all your neighbors you sum them up you multiply them with your own state and uh then a matrix multiplication and division this model originally at least in its original implementation doesn't support different edge types so you see there is no dependence okay okay but again you see a permutation invariant combination and a different way of of updating your state okay so uh the first thing we i didn't discuss so okay we these are design decisions all of them are different modeling decisions and there are tons of them you can go online and find more than 100 papers that change something in this equation but the original equation is still more or less always intact you have a way to prepare messages a way to summarize them a way to update your state so the first thing is now the following the there is one trick that i have omitted so far so let's take for example node d node is not connected no incoming edges come through here so node has no opportunity uh to learn about how it belongs in the graph so a practical trick is that you add explicitly uh backwards edges so you create you double the type of edges you have and you for each forward edge you add a backwards edge and now you still have a graph uh use the same equations apply but information is propagated through the whole graph that's just a trick uh that uh that exists isn't this related though to the assumptions that you're making about like surely if you said that d has a directed edge to the rest of the graph and nothing else connects to d then that's a pretty strong statement about how d relates to the rest yes right so yeah so if this is the input so you don't want to change it right you mean yes yes but it depends on your application if you want if you want to learn something about d how it belongs in there then information that has to somehow propagate back to d and this is what this tries to solve is the is the lg replicate the same type or well you can say it's a different type it's the same like if this is is the left of e this is the opposite of that so like basically what you want is getting feedback right can you make it like weaker or stronger or can you like change it you could people don't but yeah there's nothing necessarily stopping you will it blow up if i just had an edge type for no edge here and just basically fully connected around but the default there are some people that do something like that where essentially you imagine a node a god super node here that everything connects to that and back it works quite well in many cases there was a question i i thought someone had the same question earlier when you're introducing um the function that that prepares the message so could you also fix this by by uh you know changing that function that prepares the message to treat your forward nodes and backward nodes differently oh sorry edges the you this thing currently i mean the way i have written this is that you take the previous state and you update the state yeah so you still don't do anything to their neighbor to the neighbors you could imagine making this more symmetrical uh but that's that's essentially what by the backwards edges are doing okay i see so okay yeah okay so let's go one step further the equation is nice how do we actually implement it in a matrix operation again uh very background things an adjacency matrix if you haven't seen it let me know the idea is that you have nodes you assign them some id and you say that node node z1 node 0 is connected to node 1 and node 2.

And the same thing for node 1 is connected to node two and node two is not how does not have any outbound edges so now imagine that we have some variable a b and c just a number for now uh for each node here then if we multiply these two things we get uh we get this thing so here you can think of it very vaguely as you had a here and now you multiplied it and b got a so you see that the second element is a whereas c here uh is number two here was connected to a and b well it is connected a to b and you can see from the matrix multiplication that this is a plus b so this kind of implies how uh how message propagation will start uh will will start working and uh the the idea is you can repeat this and create these adjacency matrices easily it's it's it's something uh something deterministic so let's look at uh at uh some pseudocode like thing right you have a matrix for where each line is your initial vector representations at time t this is uh this is a matrix and here is the the size which is the number of nodes by whatever dimensionality age has now you want to compute the messages to be sent so you multiply you do a multiplication by ek for each type of edge k with ht and now you have all the messages to be sent outside from a node possibly not necessarily all of them and now in order to get the received messages you multiply this m with your adjacency matrix a for all edge types k and you have the messages you received and now you can apply your your update function so this is the simple mathematical kind of form uh that you use and uh if i wanted to use a vanilla iron and this gru function is essentially some matrix multiplication with hd plus w with the rt just to make to make this slightly more concrete although not very useful okay yes and then you have this matrix e k which should be uh d times m and then you have m and then this is also something by m and then here here they have you need to play around with w or the g or u input to get this work but you can make it work i mean usually m and d are the same in in most of our implementations but in theory there is nothing stopping us to make this difference do you repeat what a is in the uh receive messages a which a should be a k to be start with so here's a typo but this is the adjacency matrix four nodes of type k so this is this is where this comes in like the this adjacency matrix that says that uh node zero is connected to one and two and things like that so the k's are like a finite enumeration like green and blue it's not like 4.7 in most cases yes and in the cases we're saying here definitely yes okay so yeah you can also uh you you can get uh this thing here so again um doing the next step is again mostly for a very practical step but i think it's useful uh for the residents now let's make this next step we show the high level equation we saw the the matrix now let's see this is a pseudo code first uh first what's in some i'm going to use this i thought it would be useful for for the residents the idea is that you express operations over matrices with that a is of size b q b d b is of size q d and the output you want is b q so you explain express essentially this operation with this einstein summation uh here and you can do more more complex things you can vaguely think of this that you here what happens is that you erase the dimension of d right or you summarize sum over it marginalize whatever the word you want to want to use or here you remove some other dimensions so you remove a with this and and you can express this complex sum with a simple short notation so how does the code look like well you say my initial states are somehow given to me doesn't really matter we have a number of steps uh that we're going to unfold our graphical network for each time we compute messages for each type k which is this this message multiplication that we have here what's in the slide i think but so i am still stuck on the dimensions nd times dm makes it end times nd and um first message is k statement here's that one yes and d times dn but yeah yes yes also it seems that this should have been uh well they are swapped and this should have been indexed by k was slightly yeah okay i have noticed right we should fix this eventually anyway so yeah this is just the notation here and then you have the received messages and hopefully this also gets you there and you get the older you sum the received messages and you get the geouge the num steps is it the input parameter yes it should have been an in parameter a hyperparameter or a constant well for in most cases when you it's a constant for all your ads or something like that but yes you're right it should have been also in the input so this ties with this uh with this here so hopefully this makes the graphimal network a bit more concrete the one thing i'm not going to discuss today is this uh specific thing here here we have the adjacency matrix which is n by n but in einstein summation you cannot say this but this adjacency matrix may be very sparse but very large so how can you make this more efficient that's that's something you can just we can discuss at another point in time the shapes they're also going about right it should be m times we're here yeah and um no and this is an n but you cannot write that in einstein's summation so l equals and to for your purposes but you you needs to be different yeah i want to say it's more part of this yes so okay let's i wanted to go through two applications i think i don't have the time to so i'll go just very briefly over one with one slide and the idea is that molecules for example you have molecules and you want to predict some target quantity out of it the idea is you can see the molecules they look like graphs i mean they are graphs to some extent and now each node is uh is an atom you have different types of nodes like carbons or nitrogen or other kind of things i'm not a chemist and you have different types of edges so some are double bones some are single bonds and so on so now the question is can you get from your uh from your molecule up up to some target equation what you do is you take this graph you get this and you you try to predict uh your quantity uh your target quantity that's very high level but hopefully it's more concrete than before that's one option with the other thing which i'm not going to discuss is about source code uh how we do we find bugs but i don't think that we have sufficient time to discuss through the various uh design decisions yes so are you swapping atoms then so that the colors represent different types no no this is well it's taken from the paper this is how like features may be computed about this is the same atom repeating again and again the same graph but the graphical network updates the state essentially for each node for each atom which hopefully has information that is relevant to the to the target quantity you're trying to predict we can do this uh well let's see um okay maybe let me go first to something uh just to okay let's uh let's talk a bit with about the code related things so the idea is as follows we want to find bugs specifically we want to do the following tasks given a blank in a program here we want to save which of the things in in the code uh in which of the things in the code uh are actually uh should be placed in here so here you have first and class now this helps us find bugs you like copy paste someone copy and pasted this line but the correct thing would be first first thing we need to do we want to take a program and convert to a graph how do we do it well that's a design decision that's a feature extraction if you wish the idea is that we want to take things like how data flows around these things how control flows around these things and get there so one possible thing i'm going to show here is tokens well a very boring graph you can parse code so you can essentially deterministically find the tree of this and you can add extra nodes and extra edges again a graph let me skip hide these edges and take a more a simpler program here and we can also add data flow edges so how when was x last written well when i just entered the loop it was here if i'm looping it's itself so we can keep repeating adding more and more edges and more and more different things that we believe encode the problem we have so by the end of the day we have a graph for this simple snippet this is how our graph looks like it's it's it's not easy necessarily to parse but it's uh it's not an uncommon thing in programming languages so as a design decision would you have all of these graphs into one all of these edge types into one big graph and then run your model rather than have multiple graphs with different types of edges now you have one graph with multiple types of edges right and you do the propagation simultaneously right because this is effects right yes exactly to affect each other exactly so yeah we're just showing it for clarity in the picture because it will be too crowded but but yeah yeah so this is like the abstract abstract synthetic tree of the compiler but then edges are probably specific well some of them are the of the abstracts industry some of them are token based things some of them are data flow things things like that so how do we go around this uh around doing this uh thing well we started with this place here uh and what we do is we're going to say this whole thing i'm showing it you here is a text but it's a graph i'm going to add a few extra edges and nodes saying these two candid nodes here we want first for example to be how would data flow if first was in the slot here or how class would be data would flow around if class was here and now we have a big graph i'm not showing it all of it and now the idea is we started from our uh from our code we got our graph we somehow compute initial representations i'm flossing around that part and now what we want is our objective we want the output of our graphene network to have the h of the of the slot node here to be as close as possible to the first and as far up as possible to the class which is the wrong thing again graphical networks we started somehow we encoded our problem in the graph that was uh dependent on what we thought we needed at least a graph neural network says well this is something information about slot about first about class and now this is our objective this is well not exactly this but this is more or less our objective we can compute a loss we can back propagate learn and learn things from there so that's uh that's essentially a second application and let's take this simple example where the question is what is the variable that should be here well the answer as you would have not guessed probably is full path the neural network knew that in a few milliseconds once getting this graph so that's two applications hopefully you see this is i mean graphs are very broad you can do almost anything and represent almost everything with the graphs uh but this is hopefully makes the graphene network somewhat more concrete i wanted to just emphasize two special cases for graphene networks and this is convolutions imagine you have a grid like pixels and now you take each pixel and you create the following edges one which is top top left top right uh left right bottom and so on different types of edges and you do this of course for all nodes this in some formulation of graphene electric the graph convolutional neural network for example this is equivalent uh to convolutions of course it's much more efficient to do it in the current way that that convolutional neural networks work because you can batch this computation much more efficiently done than doing this thing the other thing is deep sets the idea is that you get a set of things it could be a variable size set of things in this case it's images for example and you want to predict something about this ad so this this set of images are people with black hair and rosy cheeks and so on and now you get these sets of things and you want to predict things about about them how do you encode this problem well this is essentially a set is a graph that is fully connected everything is connected to everything again you can do a set of graphical network and at the output of the graphical network you can predict something about the set so these are like two extremes if you wish about how this uh about graphical networks and some special cases that resort to something that is more or less similar to the things we had i had slides to talk about advanced some more advanced violence sorry uh not not a lot of time for that but again we can play around with how messages are propagated and not do the propagation like all messenger all nodes update all their at all points in time that's that's all so i wanted to conclude uh to conclude with the following uh more practical non-graph neural network things uh since i expected more that the residents would be the only ones attending here uh but let's talk about a few things how would um machine learning deep learning specifically works in practice you have your data and now you have to do the following steps first from your data you compute some form of metadata some information about uh that you get from your data that will inform how your model looks like so how many words in your tag appear in your text how many different types of edges you have in your problem things like that now you given your metadata you can convert your data into tensors into things that look like matrices and encode the problem the part of the metadata for example could be that the word is is number 53 and the word r is number 54.

So you now need to you can say that you can create the one hot vectors for a specific word for example once you have all your data in tensors i mean you can do this in the flight doesn't matter when you do it you can scroll batch those sensors into into different into small groups into mini batches and how you do it usually this is a separate like problem in most cases you just have an extra dimension which is your batch size and you can concatenate essentially these tensors then you get their mini batches and you pass them into your machine learning model so you get tensors tensors go in you get a loss a loss is a scalar a scalar you can compute differentiate the loss with try to minimize it propagate gradients update your machine learning model and repeat so this is a very high level uh thing of how let's say non-computer revision machine learning models work so you have this steps of metadata converting to tensors and creating mini badges the next thing is about debugging so it's notoriously difficult to actually these models because they're not terribly interoperable but the high level there are high level recipes about how to do this especially with with deep learning i'm listing some of things here these these things here probably will get access to the uh the slides at some point but common things are take a small sample of your data set try to overfit your model if your model cannot learn even your small data set you probably have a bag somewhere try synthetic data encode the problem you're trying in a very synthetic problem and try to make your model actually solve it can you actually can it actually solve it if not then maybe you can find from how it fails where the problems in in the code appears there are many cases where you have optimized optimization issues like the stock gradient descent doesn't work as as you expected one idea is you look at the learning curves how does learning curves how does your loss evolve for training or the test set this is usually what you see if you don't see exactly like that you know something like that not exactly you might have a problem it might be you you see that your model is never fitting or it's under fitting overfitting all these kind of things you can monitor the gradients do we propagate updates do we update zeros with every like do we propagate zero information about things that's probably indicating that something is wrong uh with your model and of course there are many other ways to do things like visualizing doing error analysis i had this uh someone who was telling me that error analysis in machine learning is what debugging is to actual programming and this is like you look at the predictions of your model and you try to see where does it fail and of course there are many other things you can do many things that are project specific but i thought that since it's the first i think deep learning lecture you get as a residence this is a good set of uh uh of suggestions so with that i think i am mostly out of time uh there are certainly questions feel free to ask and yeah check so you you have a you have for your nodes you come up with like a representation but for the edges you just have like some categories are there any or do people try and find representations for edges yes yes and there are cases where you have metadata in edges like a number or something like that you people do use this you can need to modify the equations in some form to get that but it's possible in most cases in most of the existing data we have the edges are just fixed there or you have a zero to one weight to which extent does this edge exist or not do you always need to process the entire graph something like a social network is it feasible to say have a focus node and only process within a certain number of links of it it depends what you need in that sense in many cases you don't need to but it's very convenient because the data processing gpus parallelization all these kind of things if you want to summarize something about the full graph then you definitely need to do this in some other cases you just need uh to focus on something so you can say that if we take the the variable misuse exam debug thing you're interested in a very specific place so you know that message propagation the last step far far away won't affect your result you could have dropped this computation now doing writing the code to do this and doing this faster than doing it everything is a hard problem i don't think that someone has achieved this but so you you said that uh convolutions and deep sets are generalizations of graph neural networks yes well special cases so yes um so what's the message passing pixel pixel and instance in those cases um sorry well maybe something has gone wrong i think with the uh so the convolutions thing so this is the in a cnn you multiply the state of each pixel let's say uh with with something that says with a kernel this is essentially the kernel you would have and you update the state of that note forget max pulley pooling and things like that for a moment but you do you repeat this so this the edges the weights of the edges that you use uh is essentially your kernel and of course this i i'm showing this in this node but the same exact edges would appear in this node in this node this node and so on i'm not sure i answered your question in the other example so here you get a cnn you each image is a vector now this is the initial state of each of those things the simplest graphene network is one without message passing so you sum up everything or you do a mean pulling or if you want to do something more complex then you still it depends on the flavor of the gin and the mystic is what sort of defines graphene yes yeah so here you can say that um essentially each for each image let's say you have this initial representation and you get representations from all the other images in the set and now you update your own representation as in what are my commonalities as an image be with everything else and the same thing happens with a with c with d and you can keep repeating this a few steps and then at the end of the day hopefully each image kind of knows how it's similar or how it relates with every other image then you can need to summarize the the output of these four in this case or all your images which could be an average pulling for example average on the vectors it depends on also deep sets have a different formulations depending on the paper you look so yeah yeah but yeah you can transformers for example are deep sets they don't explicitly say that and this is all to ultimate attention in some form okay cool okay thanks