Alerting best practices – the thin line between informing and over-informing (Google Cloud Next ’17)

hi everyone thank you for attending gcp mix and coming to our session alerting best practices my name is Amir hermelin and I’m a product manager on Google cloud platform and specifically I focus on stackdriver with me here is Thomas I am Thomas rest I’m an SRE on one of the three teams that runs the stacks I Rebecca and part and we’re here to talk today about alerting best practices and what we’ve learned monitoring and learning services at scale at Google and also helping our users monitor their services on cloud platform a lot of the topics we cover are included in this book sre the site reliability engineering handbook and will also be giving away a couple of these signs at the end of the session so we’re sticking around it’s time by the VP of s3 which is kind of cool so a little bit let’s go over the agenda so we’ll start with a little history on how monitoring and alerting has evolved at Google go over an overview and some quick definitions just so you know we’re all on the same page and then we can dive into the monitoring and alerting the philosophy the why the what guides us when we do the things we do then we’ll go into best practices and lesson learned lessons learned and this is the more kind of practical part of the of this session towards the end we’ll talk a little bit about suggested processes and then we’ll wrap up with Q&A so before we start slides are not self explanatory so if you’re just wrapping up that tweet about how wonderful this conference is go ahead and do that but then I suggest you focus your attention on us and I really appreciate it so basic creative to frame this talk a very brief history of monitoring’s into about two thousand five at Google we had a binary they’re called Boardman it’s actually it’s in the book but you can think of it as Prometheus in recent terms prometheus is sort of an external recreation of boardman it’s a single binary collection evaluation Clive monitoring system and every team would run their own the core binary and it has service scale but you had to scale it manually you had to arrange it in tears and like class up aggregations to watch the global tier and there was already sort of a movement towards centralization so you’d have some supporting infrastructure that initially was per team but migrated to be run by what then became production monitoring a theory in about two thousand seven or so that I’m now a part of and this infrastructure including things like metal monitoring so you want to monitor that your board mods actually work otherwise monitoring is kind of useless you’d have long-term history storage so you could serve queries over multiple years and you would have the alerting outbound part and in 2012 prod- retook on monarch as the new state-of-the-art monarch is a large distributed monitoring system if you’re interested there’s a great talk by one of the monarch tl’s john banning given at monodrama 2016 you can find it online it is google scale in one system so we actually literally monitor all of google in one monitoring system and it is monitoring as a service in that as a user you only have to provide a config this still requires some knowledge right I mean hence we’re here giving talks but it’s much better than having to run your own board months and this sketches to secretary so in 2014 we acquired stackdriver and I want to emphasize that stackdriver is a natural evolution to our cloud users it does not replace what we’ve been using for years at Google so instead of that it runs on top of monarch and on top of our alert management system and it uses the years of experience of running services at scale but stackdriver is geared solely towards users of the cloud it was built for monitoring cloud applications with a focus on easy getting started experience very user-friendly UI and mantra which is still true today which we strive to you don’t have to be an expert in monitoring to you monitor your service as well so we prefer that you focus your time and energy more on innovation and less under the administration’s that was true when we acquired stagger ever and it’s true today stackdriver has evolved to be more than just monitoring so it offers a wide range of other services and it supports not just gcp but also AWS i’ll cover that more towards on the end of our talk so now let’s go over some definitions and so we’re sure that we’re all kind of speaking the same language so these are commonly used terms in monitoring and alerting sometimes they’re overloaded especially the alerting part so when we talk about monitoring we’re talking about you know measuring performance correctness observing building instrumentation so that’s really extraction of metrics from our application from our services into something useful that we can use for dashboarding and for alerting logging is another type of information extraction logs are a little different than regular metrics in the sense that they’re immutable they have a timestamp but they can also serve for monitoring for example extracting metrics from logs putting them in dashboards are learning on them etc etc alerting again probably the most overloaded term and it can be a verb and a noun so alerting is the process of also configuring your alert rules and you know maintaining them it’s also the process of detecting issues opening incidents and informing is some sort sometimes also called alerting so the act of informing users that are you know you that something is wrong with the services that you’re running then we have V bugging or figuring out the root cause that happens usually after learning it’s getting down to the problem it’s mitigating the problems and at the end when all is well again we we sometimes do post-mortems kind of to make sure that whatever happened does not happen again or is treated more efficiently user impact will be an important term and we gave it its own slide 43 definitions the SLA ino to measure something like user impact s3’s especially set themselves service level objectives so as an example you might have an uptime objective for using up time here is sort of a health check you just issue an HTTP request and see if it comes back of say three 9s right ninety-nine point nine percent success and for every SLO you would track the corresponding indicator that’s your SLI which in this case is up time there’s another term SLA which is the actual business agreement entered into with customers the distinction between a and O is kind of irrelevant to this talk so we’ll skip it one note here is if you if you were aiming to go for like five nines because you heard I don’t know Google does it or something be aware with like everything you hear later you only have five minutes per year for your entire page response up to mitigation if you actually have anything happen okay so before we dive into the monitoring philosophy and banks practices etc it’s very important to look at the process around alerting a lot around monitoring at a higher level so Thomas mentioned s ellos we start by actually defining them you have to realize what you want to monitor before you start monitoring it sounds trivial but we find that sometimes people do it the other way so you know think about what are yours hellos what do you want to monitor what do you want to put on dashboards what do you wanna instrument you make sure you get those metrics either from the system that you’re using the assistant level metrics or custom metrics and then when you have all that you can start actually configuring your alerts so you know what your alerting on everything all as well and then an alert triggers so one thing we want to emphasize with this diagram is you know this process is iterative you’re never really at a point where you have the perfect configuration and there’s nothing more to do because things change realizations change s ellos change your service changes so it’s an iterative process but especially in the beginning you should be prepared to kind of change your alerting rules in the way you’re alert so if an alert triggers usually an incident is open we use that term to define like a problem that somebody needs to take care of so there’s a time span an incident is active and notification is being sent so somebody is handling the the issue now the first thing to figure out is if this issue is real if it is real you need to mitigate it you know debug maybe escalate but if it’s not real you need to look at the alerting policy that triggered rather than just going getting back to work and not waste time the next time it triggers and see what you need to maybe change in the alert policy or maybe delete when all said and done the incident is resolved if it was real and if it was you know serious enough maybe a post-mortem is needed and we go back to this this cycle of iterating on your learning policies so now let’s Segway that you know we’re done with the definitions and we want to segue into a little bit about monitoring what it is and what drives good monitoring so why do we monitor right why you here why do you use a monitoring solution so these are kind of them probably the main use cases we monitor because we want to detect and analyze trends it could be business metrics it could be serviced metrics you know we want to see for latency is trending up or training down or some resources or training to depletion I want to compare over time for example if it’s business metrics for one knows car checkouts how do they compare to the last season or the last month if it’s late and see how does latency compared to our air ratio compared to last week we build dashboard so we use metrics to build dashboard so we have an Glantz introspection into our services rather than having to go and do a lot of work each time we want to get some answers to basic questions of course we use metrics for alerting which is you know the main main topic this is this discussion and we also debugging is not part of monitoring but it’s very strongly tied into it we use sometimes we instrument metrics that will help us debug that will help us drill down we go we’ll go into that later sometimes winstram and ask for its just for the debug part just for the firefighting part so it’s important to mention that so what you really want to make sure you’re monitoring this is what we call the four golden signals and that’s a painting of the four horsemen so if you want to make sure your service doesn’t get to our Magadan level you want to make sure that you’re monitoring these things at least write the first two are errors or a ratio and latency why are these important this is kind of these are kind of the outbound facing metrics of your service this is how your users view your service so if your latency is bad or if you have too many errors that’s immediately reflected in the quality of service that you provide the other two traffic which is self-explanatory in saturation which is really the the level that your resources are full or the level that you’re utilizing your resources you can look at these as kind of the view of your service towards the inside so this is what your service if it was a person this is what it would be seeing it seeing the traffic and it seeing the saturation internally and these two can lead to error in email to errors or bad experiences that will eventually reflect on your users so you’re measuring these four signals but then there’s a question of how right then we have an important distinction here which is the difference between black box or probing and white box monitoring and so black box is if you if you monitor from the outside you treating the system as a black box hence the name the advantage here is its independent in terms of failure modes at least largely usually which makes it more reliable and it’s better for alerting one example I mean you might have an uptime Jack again for a stateful system you may want the stateful program you may want to I don’t know if a file system you write a new file put some content in read it back see if it comes back and then delete it again to clean up and the opposite is monitoring that works in the cooperation of the system which is great because you get a lot more detail and you’ll need that detail to do the actual debugging but again the failure modes may be tied and so you usually don’t want to rely on this for alerting a good example of one thing you could not possibly get from black box is the the bottom example if you compare the actual memory handed out by your Yoram malik versus the memory given to your process by the operating system there’s no way you can get that without it being exported from your process right so just to put this in context with the monitoring solution that we offer on just a show of hands how many of you are familiar with stackdriver great let’s say eighty-three percent that’s good the ones that aren’t you should be but what the way we offer these for example and stack driver which would you know most of you should should be familiar with is black box monitoring is our uptime check seen you should be using that you know to set up up damn checks to set up alerts on these kind of and use probes from the outside white box monitoring so for example the system level metrics that are provided for your resources for example custom metrics that you know you as users need to instrument for example the logs that are available via the stack very sweet all that is white box monitoring we have two slides here about sort of a pet peeve of mine an FAQ with monitoring is what is the difference between those two graphs there traffic a monitor data system in this case the difference is sampling rate so I made artificial grass to show the example but the left one has one minute sampling rate and the right one has one second sampling rate quickie and if you have very short spikes on the minute for example which happens surprisingly often you will never see them on a typical one minute sample monitoring system and again one is really typical one second has a significant overhead in terms of like how often you need to go talk to your system how much it takes to store that data and another one is if you look at distributions Layton sees in particular so for example axis here is latency if you’re just looking at the mean and maybe I don’t know mean plus Sigma’s you’re already missing a lot then maybe you heard about percent of my percentiles are important but then the distribution that motivated this slide I look something like this this is from a caching front end for for a storage system and so the behavior see is the very low latencies like sub 10 milliseconds is it akash it and the bump at something like 600 milliseconds is if it has to go talk to its back end and this is exceedingly common and you will fail to see it if you only look at latencies at presenters excuse me and this now gets us to the philosophy that you’ve been waiting for and we have three core points1 you should really be alerting on symptoms using s ellos we will have some something that you’ll see in a slide in a moment that we call this one criterion it’s really really good to remember it and we have a human aspect that we’ll discuss so the first thing this is a section by the way about philosophy so we’re going to talk about the why and later we’re going to show some best practices on more on the how but the philosophy around alerting on symptoms this focuses you ask somebody that’s monitoring their services on the things that really matter to your users so by alerting only on symptoms or the vast majority of your alerts related to symptoms that makes for a that you’re not wasting time are meaningless alerts that may be alerting on things that you know you could avoid notification if I’m in the first place and also it focuses you on really being alerts being tied to the quality of service that your users are seeing the second part of our philosophy is one metrics these are metrics that you should put every alerting policy through to make sure that it’s valid assuming we have a finit budget of time you don’t want to be wasting time on meaningless alerts so any alert that doesn’t pass these tests should be candidate for removal the first one is judgment an alert should require human judgment otherwise why waste a person’s time and also it can really upset people when they get frequently alerted and they feel like you know they don’t have to think or do anything it’s just you don’t have to use any judgment and that causes turn it should be urgent either right now or next business day because again if something isn’t urgent why are you diverting focus from other more important things to this if it can wait for a week or even a month it should be actionable this one should be clear but I just I want to explain it it means that there should be some action that needs to be taken as a result of this alert other than reading the alert notification now it doesn’t mean it doesn’t mean that you you can mitigate the the problem take for example I mean we know cloud providers there are always failure free but let’s say your cloud provider has some sort of failure you should still know about it even though there’s not much you can do except email me and others and hey what’s up but even then you want to maybe explain to your users what’s going on or through your page or through your Twitter page you know react to that be on high alert maybe later divert traffic etc etc so that’s an example or something is actionable but it’s not something that you can do with fixing the service and the last but not least definitely necessary if it doesn’t affect anything users service business why alert on that maybe there are other more efficient ways the human aspect the most important one we think is you have to consider the stress budget every alert triggers a stress response this is good right you need this this adrenaline boost to deal with the alert especially you’re getting it at the middle of the night or something but it has a cost to the human it’s a very biological thing and you need to make that budget last and there are two basic ways you can do this you can keep the frequency of alerting down just the pure number or you can make the human more confident and thereby reduce the stress for the individual alert and we call this providing help because if you support them in especially the very beginning of the response if they’re better trained they’re better documentation they will experience less stress and this the second aspect of humans is they’re good at pattern recognition and somewhat lazy again nurseries or automate themselves out of their jobs in alerts especially this means that specifically for actionable mostly and people who realize that something is not actionable and start ignoring it at least subconsciously and today of the three core points again symptom-based soo alerts 1 and the human aspect and will now try and develop some actual best practices that follow from this first 4s ellos how do you choose an SLO and this is not necessarily an alerting thing but it’s very important because it’s a start of alerting you should really bound it between what your service can support and what your customers can support right so the service gives you an upper bound of of how much you can ask and if you if you get that wrong you put the SLO too tight your own colors will burn out that’s like the number one risk is you get too many pages you people burn out if if you get the lower bound wrong then your customers will leave right so you have to get it between those and just as a side note and the three nines of uptime as a necessary tenant also means you have point one percent of downtime you can use to break stuff take risks do something so we want to augment that are you all familiar with the fail whale and Twitter early days and we actually want to bring up this example as a positive example you can’t always be a perfectionist and you can’t always meet SLO that leaves your users and sometimes you have to make decisions and business decisions such as not meeting your s ellos because you’re experiencing growth and the size of the team and the size of the new users you’re getting every day just too much to meet your s ellos and that’s fine as long as you make a conscious decision say you know what a lot of my users are going to be seeing this or high latency or not being able to log in because right now it’s all ons on deck focusing on growth but it has to be of course in a controlled manner and you have to know why you’re you know you’re not meeting your solos so let’s look at a one example matrix so we we talked about why but let’s look at you know take few examples for each for each of the kind of four tests so first of all judgment good good judgment is when there’s an SLO alert right you have to determine where the problem is or there’s some roll out and you have to determine if the release is good or maybe you have to roll back the release they have to make some judgment there that example is where in automated actions such as server restarts take care of things and you still get alerted on it even though you know the automated action took care of things or when something can it isn’t but it can be simply fixed with auto scaling if all the scaling is doing its job you shouldn’t be alerted on it even if resources are kind of being stressed another test is urgent so definitely SLO and security issues both are very very urgent you should alert on all of them but if you find yourself saying something like well you know if this doesn’t doesn’t get fixed doesn’t resolve itself in 12 hours then we’ll look at it that’s a candidate to kind of revisit the alert maybe again you’re you’re you’re wasting time there and it’s important the reason we’re emphasizing this is again you have a limited budget for your on caller turn so you want to spend that budget not get to alert fatigue you want to spend it on the right things versus of the wrong things that don’t pass this test so another thing is actionable when you need to determine if something needs roll back when you need to determine if you have to if you meet the add capacity and it’s not automatically added a bad example which is actually quite common is alerting on cpu percentage even though again auto scaling can handle it so if auto scaling scales when your CPUs are you know to stress you know you don’t need to alert on that or another thing is when just issues go away so you see this it alerted it was red now it’s green nobody looked at it and we go on and it and it edit it costs time even even if nobody handled that alert people look at it and it does cost time and of course necessary so if it affects users impact service impacts business metrics anything else is not necessary if it’s or if it’s covered by other alerting rules for example so you see these alerts that fire with other alerts together all the time so great candidates for removal next on some best practice examples of how to think about symptoms so SLO czar a little bit more obvious right you have uptime latency errors or ratios but internal components you sometimes also have to alert on them right if you look at your internal component and think of it as a service that’s servicing other internal components then you can figure out what are the s ellos of that component towards its peers and what are kind of the black box metrics that you want to look at when you’re looking at that component for example if you take batch jobs for example if they don’t complete on time something is happening in there but that might affect other services and eventually that will affect your users so you do want to alert on that and that’s actually that is alerting on symptoms we do concede that if you have your SLO alerts sometimes there’s one more thing you really want to do and we call these cliffs because because of how they measure when the system fails catastrophically as in drives itself off of a cliff and the typical case here is the resource alert right if you write your user data to disk and you run out of disks or disk quota or whatever you’re not going to space today and the advantage of then having a ninety percent disk usage say for example alert which is a very typical thing to do it gives you early warning the SLO alerts would only fire once you’ve actually exceeded your quota and things are actually broken I want to call out the distinction here if SLO alerts are over paging you and it’s hurting you you need to go and device the SLO if a cliff alert is not giving you good signal-to-noise just delete it right there’s nothing else to adjust just delete the alert so now we want to spend the next few slides talking about the human factor which is very important in the alerting equation and really around two main things one is you know what what’s called alerting fatigue or you know alerting fluke frequency and you know kind of alerting overload and it’s a big problem especially a service of scale and the other is remembering that at the other end of the alerting policy that you configure today another person completely different person might be paged so how do we deal with that so firstly with the frequency limits Thomas what have we learned over the years so every has a fairly firm guideline of having to outages per 12 hour shift on average at most you will note i said outages if you have multiple pages fighting for the same outage it shouldn’t be money right but it’s frequently hard to get it down to really one unlike for every outage so we don’t insist on that it should be a few but the two outages per 12 hour shifts are important if you get significantly more than this you’ll have burnout this can be at an actual individual level where someone literally gets burnout I can be at the team level where morale things people leave and you will get sort of i would call it quality decrease right people don’t consciously ignore alerts but subconsciously they kind of no longer have the bandwidth they do shallow investigations to push more times don’t really root cause anything you’re never really fixing anything and so you never digging yourself out of that hole you might have far fewer do I have to say in a sort of second law of thermodynamics sense this this never happens because the number of pages only increases unless you have an active investment to decrease it but if this were to happen right you might have this considered two points one is handling pages gives you a lot of practice with the system it’s sort of the most hands-on thing you can do you can compensate this you can do training right that’s not a problem but you need to make an effort and you may also ask yourself is the investment in having a 3 support warranted if they’re not even getting any pages and I have some examples here I want to go in some detail into the first one the alerts cried wolf problem we had a subsystem of Monarch Deadwood on average like once or a couple times per week have have a problem and the problem will go away sometimes you would tweak a setting sometimes you wouldn’t eventually will go away and that led to sort of this conditioning that I mentioned Isis subconscious you know this isn’t really a thing and eventually of course the inevitable happens and we had a real problem and it was unnoticed for a week even though if you knew where to look you could tell that it was a completely catastrophic failure that system was not making any progress with a batch pipeline and of course like in proportion to it being stuck in that state for a week also that the magnitude of the problem then increased and you really want to avoid this you see this pattern when people say the alert went away by itself and then push back on it the other two I think are somewhat self-explanatory i just wanted to mention i observed all of these firsthand right and so all of this does happen watch out for it so this is how fatigued actually looks like this is a screenshot from stackdriver the red lines and dots at the top are alerting incidents and this is the time spent from 5am to 8am now you know how many of you want to be looking at this at 8am raise your hands yeah and you have to go and figure out which of these alerts are real and how many issues you have and what to do with them you know that’s clearly too many at 8am you know you want to be drinking coffee and reading the politics section of the newspaper but not looking at this so that’s an example when we’re done on the first thing you do after after getting a page you will especially fix an SLO based page right it’s probably coming from a prober after what we told you if in a simple case the program might have some back-end so for simplicity I put two you’ll want to know is the problem with one of those or is it somewhere behind them so that they uniformly get errors and we call this drill down and it the drill down might look something like these these two graphs and we have on the next slide and it tells you right if in the first case you probably assume its own some back-end in a second case clearly one of the services responsible you just need to hover over the line and restart that guy I mentioned this because frequently in a real example you will have a load balancer in between and you need to think about how you’re going to do the signal which backends you picked from like how you you’re extracting that from the prober before the pages happen right because otherwise you’re blind and you cannot drill down when you actually have a natural evaluation so the other thing you want to do to help the human is used ash boards and use what we call the occu- in stackdriver sometimes it’s called referred to as run books you want to make sure that whoever responds understands what the alert is about what was violated what’s the SLO understand the subsystem what subsystem it you know relates to you know in your service where to go how to mitigate you have to provide that especially you know up front and in the past escalation now the most common mistakes that are made are around mitigation and escalation because when somebody responds especially if you’re going to meet a high level of nines the tendency is to go and debug in six but actually the first thing you want to make sure is that the problem is mitigated maybe it’s already mitigated but it’s going to crash very soon or maybe you have to do something about it like rollback etc that’s the first thing you should do and then escalation whoever is in charge of this should provide clears collation pass and also it’s desirable to add you know escalation timeline so people don’t spend too much time being stuck before they you know refer to their colleagues so that was kind of those last few slides were about helping the human now we go into another thing which is notification methods you want to have redundancy with your notification methods and not rely on a single alerting notification methods because problems can occur at the sending end or problems can occur at the receiving end where you know the uncle is not able to receive that type of notification so if it’s just m+ one on the right side you have a you know stackdriver screenshot it’s very easy to add additional notification channels the free tier includes email and cloud console app so we recommend you use both of them for all your learning policies if you have it if you want additional methods you can use on call software like and rotation software like Pedro duty you can use chat applications a web hook or SMS to add to that I want to call out SMS here is an SRE not because it’s especially reliable it totally isn’t but it’s the most independent from all the others among these so if you pick three methods I suggest you put SMS there because it gives you an independence from all of the other styles simultaneously to some issue in the system and we have an example of this actually happening I mean it’s really embarrassing if your pager fails especially if you’re on prod monetary and you’re supposed to keep the page we running we had an example where do to a male loop where every outgoing page would loop back through mail and come become a new page and we delivered about 100 pages per second to a single person you click and show the image at oh there he is so I have to admit this is for a different outage but that one was on a similar scale and the poor guy actually received a large number of pages what happened was we in for the internal one for the internal paging app we use an app engine server back-end and we just used up the entire quarter of that back-end and so it became non functional for Oliver Sri until someone managed to intervene about I don’t know two and half hours in myself and I say he looks happy for someone that received so we got a t-shirt engineers are always happy when they get you sure that’s totally worth it something that you might be more familiar with because it was very very public as the DNS outage of late 2016 many many prominent services went down and you know even though our services were still up and running a lot of third parties were experiencing failures so we got a lot of support calls because users were relying on one service that might have gone down or the api’s are not available or the email gets straight limited so again very important to have redundancy with you know the way your your alerts are getting delivered so here are some additional best practices kind of each each one each slide stands on its own one thing is grouping so you want to reduce your alerts especially similar alerts that kind of policies will do the same thing for a bunch of resources that act in unison you want to group them together into a single policy if it’s stackdriver you can use stackdriver groups which is kind of you know a concept that is inherent in the platform but think about when when problems occur you don’t want to get paged a hundred times you want one alerting policy to fire and then you know know that it relates to to you know a group of resources one thing to note though and remember is that if you’re doing this you have to have efficient drill down mechanisms because in many cases when that alert fires you will have to kind of investigate and see which of the resources of which components of the group misbehaving and you know that hundred page storm that we saw earlier mouths that might sound funny and that person was mining after he got the t-shirt but it does happen every year to even inside of Google so you know be prepared for that and be you know ahead of the curb and you know how to act a hundred pages that arrive simultaneously you do a lan over so we have two more slides on process just like some brief notes if you organize an on-call rotation our guideline in s3 is forever I mean that’s sort of quick response right twenty-four seven coverage five minute response time to to working on the problem being at the laptop you want two times six people in different time zones or or eight people in the same time zone again this is just a burnout problem they’ll be on call too much if you have fewer people and open out there were time similar to how you have then plus one notification methods you should probably have n plus 1 people who get notified I don’t know the primary might be on the new york subway they don’t have any reception there so we recommend you have a secondary this could be someone from some other rotation or someone on the team and then if that still doesn’t work you escalate to the entire team because you just need your panic mode I guess and what’s really works for us is having I mean if you’re doing weekly shift anyways it makes a lot of sense having a weekly on coal handoff meeting they should only take that a no half an hour and post-mortem process I mean you you probably know that basic drill right one thing to call out here is we generally right post mortems when either the outages and was announced widely even internally or that would user impact but this is somewhat circular because in the other direction the saying goes you should really announced widely internally at least if you think you’re going to write a post mortem people learn and develop some experience for this and and one thing to remember in the context of alerting is it’s somewhat tempting in a post mortem to draw us and as an action item add an alert this should not usually be your only action item right it’s it’s justified to add an alert if you could have detected a problem sooner by having this alert but you should still go and the root cause and push more times that say add an alert and nothing else or how you get into a situation where you have page overload so with all these best practices we didn’t say anything about Google cloud platform yet and we didn’t include any plug so the best best practice in my mind is that running remember running monitoring yourself is a lot of overhead and folks like Thomas are hard for come to come by and they’re not free so the best thing to do is actually let somebody else do your monitoring additionally if you’re running your own monitoring you have to make sure that it’s running correctly which is called meta monitoring monitoring your monitoring instead Google cloud platform does offer stackdriver and as I mentioned it’s more than just monitoring it’s an integrated suite of products that includes monitoring logging alerting functionality trays debugging air reporting and it’s very very tightly integrated so for example a log that you ingest goes into the logging and you can analyze it there but if that’s an error it can also trigger error reporting it could also trigger an alert you can take that log and extract metrics torment put it in dashboards alert on these or you know do whatever you want so the nice thing about it is it’s all integrated out of the box great experience and there’s a free trial there’s also a free tier you know if you’re not using it to a large extent you could stay on the free tier but even premium pricing is very very competitive to our discussion today it’s also geared towards helping you utilize your time better and actually implement best practices around alerting so does offer up time checks to help you focus around SLO flp focus around symptoms it offers multiple notification methods and run books and other kind of custom dashboards to help the person that’s responding respond efficiently to you know to alert the specifications it offers great drill down components oh really you know we take our experience here and we use it when we’re implementing these services so let’s see if we remember everything if you have to remember just one slide remember the one with a guy in the teacher not just kidding remember this this slide let’s go quickly go over the four golden signals you should be monitoring them latency and traffic how your service is viewed from the outside and I sorry latency and errors are services you’d from the outside traffic and saturation how your service you know looks from the outside inside one every alert policy should be tested to meet these criteria judgment every alert should be urgent should be actionable and definitely necessary you should be alert being mostly if not entirely only on symptoms and these akmal include s loz and you can also alert on cliffs that will you know induce symptoms don’t forget the human factor so limit alert fatigue limit the amount of pages that somebody can get in a shift building debug ability building the ability to mitigate and help the on-call or mitigate and provide a clear escalation path for when people get stuck more than one notification method is the next thing use email use the cloud console app SMS whatever but use more than one and lastly remember that the process is iterative right so you have to respond efficiently you have to configure your alert alert sufficient you have to mitigate and also post mortem even when the problem is gone let’s make things efficient the next time around you

Add Comment