Smart Alerting and Service Level Objectives With Yuri Grinshteyn

Posted on Monday, Dec 9, 2019
In this episode, host Julie Gunderson talks with Google’s Yuri Grinshteyn, Customer Engineer at Google Cloud and specialist in reliability, about the importance of smart alerting and Service Level Objectives.

Transcript

Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. Today your host is Julie Gunderson, @Julie_Gund on Twitter. This episode was recorded during the PagerDuty Summit in 2019. Julie got a chance to speak with Yuri Grinshteyn, customer engineer at Google Cloud. Yuri specializes in ensuring system reliability. So their chat focuses on the importance of smart alerting and SLOs.

Julie Gunderson: Welcome to PagerDuty studio. Coming to you live from PagerDuty Summit. I’m your host today, Julie Gunderson, Devops Advocate at PagerDuty. And with us, we have Yuri Grinshteyn, customer engineer at Google. And we’re going to be talking to you today about smart alerting and SLOs. Yuri.

Yuri Grinshteyn: Hi Julie, thank you so much for having me. It’s a real pleasure to be here.

Julie Gunderson: Well, thank you for coming here. Tell us a little bit about yourself, Yuri.

Yuri Grinshteyn: Sure. So currently I’m a customer engineer at Google, actually in Google Cloud. I’m their specialist on reliability, which means that it’s really my job to help customers architect reliable systems, but also to help them do things like figure out alerting, which obviously is very relevant given that we’re here at the PagerDuty Summit and we’re generally talking about all things PagerDuty related.

Julie Gunderson: Well, I know you told me the alerting is really near and dear to your heart, which is a fun thing to say, but tell me a little bit more why alerting is so important to you.

Yuri Grinshteyn: Yeah, of course. It’s interesting that you say that. It’s not even that alerting in itself is near and dear to my heart. It’s actually one of the soap boxes that I find myself on on a fairly regular basis when talking to customers. It so often happens that customers will look for help with alerting. And they’ll say, “Hey, how do I know if I have an issue with my infrastructure? How do I alert on things like high CPU utilization or the high memory consumption?” And especially lately with lots of customers interested in Kubernetes observability, a lot of the questions that I get are around, how do I know when the pods are running out of memory? And this inevitably sets me off on a long and rambling rant about why you should never alert on things in your infrastructure. Why you should only alert on those things that directly impact user happiness.

Julie Gunderson: So when customers come to you, what is something that you see as a common mistake that they made? You mentioned alerting on your infrastructure. Can you tell us a little bit more about that?

Yuri Grinshteyn: Yeah. So it’s funny because I feel like it’s not even that folks are really making this mistake. It’s just something that I think we in the industry have maybe even trained our customers to do, maybe not for probably two decades now. Right? I’ve spent most of my career in monitoring diagnostics, troubleshooting, and lately observability. And I remember having conversations with folks, I don’t know, 15 years ago. And they would say, “Hey, we’re using this monitoring system, but we get all these alerts from it. And so we’ve kind of ended up basically ignoring them. So can you help us figure this out?” And a lot of these systems, 15, 20 years ago would ship with just everything turned on, right? So all the monitoring would be on, but then you’d have all these alerts that would be configured out of the box with all these sort of thresholds that people consider to be the right thing to set. And so the second that you would enable your monitoring, you’d also be just flooded with alerts, right? And so you would kind of sit with the customer and they would pull up their email client and you just see like literally 1,000 unread messages in their inbox from the monitoring tool. And then folks got a little bit smarter and they started creating email filters. So they would just filter out all of their alerting to go into a folder where those are reviewed at their leisure. And then I think this is where folks like PagerDuty came along and said, “Hey, you’re getting all these alerts. You’re not really using them to drive action. You’re either ignoring them or you’re suffering from alert fatigue, which I’m sure you’ve heard before. Let us do the job of your email filter essentially.” And I apologize if I’m misconstruing what PagerDuty really started as, but that’s the way I’ve always thought of it as like, people just had all these alerts coming in and PagerDuty was basically the filter. It would really tell you which one of them was the important one. And then they would escalate it to your pager. And so, because of the great work that y’all have done, we never really taught customers that, “Hey, you shouldn’t actually do that. You should not generate 1,000 alerts in minutes and then let PagerDuty figure it out. You should actually go back to the root of it and figure out what is it that you should be alerting on. And then when that’s the case, you can certainly use PagerDuty as the delivery mechanism. You can make sure that there’s proper escalation behind it because maybe the monitoring toolkit that you’re using doesn’t really have clever escalation paths.” But the fundamental problem that you’re trying to solve, which is like, hey, I get 1,000 alerts an hour and I don’t know which ones are important. That’s not the problem that we should be buying PagerDuty to solve. That’s a problem that we should be addressing at the root.

Julie Gunderson: Yeah. And PagerDuty started in 2009 as an alerting system. It was just alarms going off. But we quickly realized that it became about modern incident response and real-time operations. So we’ve combined event intelligence and visibility, but I’d say one of the big things too is alert grouping. Because as you mentioned, with all those alerts going to an email, people putting that in a filter, they miss the big ones.

Yuri Grinshteyn: Yeah. That’s exactly right. Right? So if you have a filter in your inbox which filters out everything from the monitoring system, then what happens when the important one comes in? What happens when you have a real outage? You’re going to miss it, and you’re basically training people to ignore alerts, which is really bad if you’ve got folks that really should be on call and you’re trying to deliver on your highly reliable systems.

Julie Gunderson: Now it’s easier with greenfield, right? You can set up your alerting and your notification with the best practice right there. But let’s talk about those monolithic apps. What do you do? How do you go back and retool those alerts?

Yuri Grinshteyn: Yeah. Good question. And I am going to just continue to go on my soap box here and to say like, I don’t actually care what the application is. I don’t care how it’s architected. I don’t care if it’s monolithic, if it’s microservices, if it’s a web app or something else. What I really care about is who is the service owner, right? Who is ultimately accountable for reliability of that system? Who’s the one that receives the notifications when there’s a problem? And what does user happiness mean in this context? Right? Whatever the service is, there are users of it. Those users may be people, there may be other services in case of an API. We may be talking about a data processing pipeline. And in which case you have inputs and outputs, and then you have data producers and data consumers. But in all of these cases, the things that we really need to understand is what is the actual service we’re providing and what metrics can we use to determine whether or not our service is doing the job that we built it to do. And so obviously everything I’m saying is well documented in the SRE books. And I’m really referring to the service level indicators, which are the things that tell us whether or not a service is doing the job that it’s supposed to do. And then service level objectives, which is what are the things that we measure our indicators against to determine whether or not things are well or there’s a problem.

Julie Gunderson: Yeah. And tell us a little bit more about how you’ve used SLIs and SLOs to maximize the customer satisfaction at the end of the day.

Yuri Grinshteyn: Right. And so we use SLIs as a proxy for user happiness, right? For user-facing services, the things that you’re really going to care about are availability and performance, right? Availability is like, can the user do the thing they’re trying to do? And performance is, how long does it take them to do the thing they’re trying to do?

Julie Gunderson: So when you set up your alerting, you’re always keeping that in the back of your mind.

Yuri Grinshteyn: Yeah, exactly. So I would actually recommend that the only things we should get alerts on, and by alerts, I really mean things that we consider to require immediate human intervention, right? Those are the things that go to our pagers. Those are the things that wake people up. Those are the things that then result in escalation if the initial page is not acknowledged, right? Whatever the case may be. If we have a good indicator in place that tells us that a service is healthy or is not healthy. And then if we also have a good service level objective in place, which tells us like, based on the value of the indicator, what do we compare that against to determine whether we have a problem or not? Then we can very confidently configure alerting in a way that says, okay, here’s the value of our indicator as compared to our objective, we’re either meeting our objective or we’re not. It’s also important to understand the concept of error budgets, which is essentially just sort of the inverse of our objectives, which is, let’s say that I talk about a service that has 99% availability over a rolling 30 day window. That means over the last 30 days, 1% of the request to that service can fail. And so I track that error budget. Obviously if that’s 1% over 30 days, my average error budget burn rate is going to be, what is it? One 30th of a percent per day. And then if I have a clever enough monitoring and alerting system, I should actually be able to alert on things like error budget burn rate, and to kind of know when the burn rate is looking like we’re actually going to exceed our error budget at the end of the period that we’re comparing that against. And those are the things that should then be sent to people to take immediate human action. Anything else, infrastructure issues, things that are actually not directly contributing into or impacting user happiness, those should be created as tickets in our ticketing system, right? Those are the things they can take, they can be responded to in a matter of hours or days, right? They do not need to be responded to in a matter of minutes. So those things should go into the queue. They should be addressed using good engineering practices, but there’s certainly no need to wake someone in the middle of the night because one server crashed. Now, the question you asked, right? Is like, what do we do when we have a monolith, right? In the case of a monolith, a server crash may actually be enough to bring down our system depending on which server it is. Right? If we’re talking about sort of a typical web app database three tier application, and we just have one big database server, if that server crashes, that’s bad. But our monitoring should let us know that the server has crashed because we’re immediately going to see user requests start to fail. Right? Now, the server being down obviously is important in that case. But if we have a cluster database, then like one of the database notes going down is theoretically something that application is resilient to, and hopefully can continue to work. And maybe someone can wake up in the morning, have their coffee, and then get to work rather than be woken up in the middle of the night and sort of groggily trying to figure out what’s going on.

Julie Gunderson: Well, and you’ve brought this up and we hear a lot about alert fatigue, which leads to burnout and making sure every alert that wakes a human up is human actionable. Sounds great, but isn’t always easy and it comes down to fine tuning. Do you have recommendations?

Yuri Grinshteyn: So tell me why you think it’s not always easy. You’ve obviously been working with customers a lot around alerting. Why do you think it’s not easy?

Julie Gunderson: Well, I think when people aren’t using alert grouping or smart features like that and they’re doing it on their own, they still, well, I think that people want to see everything. They think that if there’s not an alert, they don’t know about something that they’re worried because they’re not receiving that alert. And it’s about trusting your system, trusting your monitoring tools. What are your thoughts?

Yuri Grinshteyn: Yeah, that’s a great point, right? Is I think people often feel like if they don’t have an alert for it, it’s not happening. You have to invest in the monitoring enough to know that this information is still being captured, that it’s being recorded, and that if you do need it for troubleshooting or debugging, the metrics are there. Just because it’s not a subject to an alert doesn’t mean it’s not being captured. And so that’s really important to have this good primary indicator of service health and then everything else is essentially something that should only rely upon when you’re doing troubleshooting or debugging. Like really when there there’s either a incident in flight or like maybe your monitoring is predicting that you will have an incident. You’re starting to see your error budget burn go up. You haven’t yet consumed all of it. You are not yet in violation of your service level objective, but you can see that things are trending in a bad way. Okay. Then it’s like, great. What’s going on? Where do I see corresponding trends in my system? But you don’t need an alert on all of that stuff because again, the only thing that you really need to take immediate action on is increased error budget burn rate, SLO degradation or SLO violation.

Julie Gunderson: Well, and the metrics are there. You don’t have to have an alert to understand your metrics, but let’s decrypt the metrics a little bit more. I’d love to go a little bit deeper for those who are not using SLIs and SLOs the way you are. And let’s go deeper on how do we really dig deep into what the customer experience means when you’re looking at service level indicators and objectives.

Yuri Grinshteyn: Yeah. Great question. So I think customer experience is usually encapsulated in the, just a few sort of fairly straightforward things. If we’re talking about your typical web user facing service, right? The first thing you have to start with is like, what are the actual user journeys, right? What do people need to do in order for your service to deliver the value that it’s supposed to deliver? So you have to have a good understanding like, what are people actually trying to do? And then some way of quantifying, how do you measure the success of those things? Right? And those are some very easy examples, right? If we sell stuff online, you just have to be able to pull up the homepage, do a search, get results, pick a product, add it to the cart, complete the purchase, right? You can split each one of those up. You can figure out how do you measure both the availability and performance and each of those steps essentially, what the page is that supports all of those actions. Actually delivering on the user experience that you’re trying to deliver on. And then it’s a matter of, okay, now that we know what are the things that we are trying to measure that represent user happiness, what are the metrics we can then use to actually quantify those? So if we start at the sort of the very first step, right? Like loading the home page, a very straightforward way to measure that is going to be basically the availability or what percent of our requests do not result in an error, are delivered successfully? And then performance, how long does that actually take? From there, it’s a matter of figuring out what is the instrumentation method that I’m going to use to collect this information? Generally speaking, the closer that we’re able to collect this information to the customer, the more accurate it is going to be. That’s not always realistic, right? We’re not always going to want to invest in client side browser instrumentation, or that may not even necessarily be an option for us to add telemetry to the user’s JavaScript to then report back to our monitoring system. So perhaps the best we can do is look at our low balancer logs that hopefully are logging every request, that are logging the response code, that are logging latency on that request that’s delivered back to the user. And then from there we have a pretty good representation of user experience for the page in question. From there we can create metrics from those logs, extract those numeric payloads and turn them into metrics. And then hopefully we can convert them into service level objectives. So we can decide that like, Hey, our homepage needs to be, whatever, four nine is available over a rolling 28 day window, then there we go. We’re basically done with that part of it. We can create a sort of the corresponding thing for latency as well. So we can say that like the 99 percentile latency has to be under whatever, 300 milliseconds, yeah, over a 20 day window or something like that.

Julie Gunderson: Fantastic. Thank you for that. Let’s talk a little bit about monitoring-

Yuri Grinshteyn: Sure.

Julie Gunderson: … and really how monitoring relates to your alerting. Let’s talk about the technical debt that monitoring can really have. You want to talk about that a little bit?

Yuri Grinshteyn: Yeah. That is a good question. I think technical debt is often expressed in this area as essentially almost a gap in knowledge often as a result of the accumulation of time. It’s like we started building this thing three years ago. It’s sort of like more often evolved since then and our understanding of it, which in my opinion, is incredibly important to good monitoring. To really have a very clear understanding of how your system works and what it does and how the components interact with each other has just not kept up with the system as it’s evolved. And so you end up in these situations where people essentially have to treat their system as a black box. And so you’ll hear the term black box monitoring used very often because they don’t actually know how it works. So just all they know is, hey, user requests go in, user responses come back out. There’re some like database queries happen on the other end of it. But what happens in the middle is really hard to understand. And I think that’s one way in which technical debt can really manifest itself in monitoring is basically, it just limits your ability to really understand what’s happening within the system beyond what are the inputs and what are the outputs.

Julie Gunderson: Okay. So when it comes down to it, it’s about having the alerting setup correct so that it’s tying to your SLIs and your SLOs, understanding how your monitoring system is setting off those alerts. And at the end of the day, not sending everything to the spam file in your email.

Yuri Grinshteyn: Yeah. I think, on one hand, I completely understand how it’s very easy for me to sit here like, hey, I work at Google, you work at PagerDuty. Our companies literally write the books on this stuff. And so it’s easy for us to sit her and say, “Hey, everyone should have good SLOs and SLIs and should only alert on these things. But the reality is that when we talk to our customers, part of the problem honestly is that there’s a real organizational inertia around how this stuff is done and they get faster to us to help our customers understand that even in a situation where you do not have the freedom to really have true service ownership, you’re not empowered to have error budgets that are meaningful in that you use them to actually enforce behavior and not just alert on. That there’s still really, really good value in understanding what is the service that you’re supporting supposed to do? How do each of the components of the service work? And what are the things that you should be alerting on because they require immediate human intervention rather than just something that can go into a ticket queue and be dealt with tomorrow?

Julie Gunderson: And I want to follow up on something. I’ve written a couple of articles on full service ownership. You’ve mentioned service ownership and the empowerment that it has to engineers. I’d love to hear what your definition of service ownership is.

Yuri Grinshteyn: Yeah. And to be honest, I don’t know that I have a great definition of service ownership. I think of it more in terms of engineer empowerment, which is if I am part of the team that owns reliability for a particular service, I have to be able to control the things that actually have an effect on that reliability. Right? If I have the accountability, I’ve got to have the power. And that means I have to have the power to stop changes from happening to the service when the reliability is compromised. And this is where error budgets are the way that we recommend folks actually implement this. Right? You measure error budget, you measure error budget burn. And if the burn is exceeding what you’re uncomfortable with, you do things, like you freeze changes, you freeze feature releases and you really work on reliability until your service is restored and you’re regaining error budget rather than burning it. And that’s what I’m talking about when I say that folks aren’t really empowered to do that, right? If you’re not running an SRE team, a site reliability engineering team, you may very well be in a situation where development and ops are still siloed. They’re not really friends. They’re not talking to each other. They’re not working across purposes because those incentives aren’t aligned, right? They’re talking about developers being incentivized by velocity and speed of delivering features and operations is largely incentivized by stability, which happens as a result of resisting to change often. Those are both totally understandable how then that drives human behavior and how developers want to push code and ops wants to do nothing because things are working.

Julie Gunderson: So what’s your advice for those teams though that are still in that space?

Yuri Grinshteyn: Try to make friends with folks on the other side, join forces and see that it’s actually better to be working together. I think one of the biggest things that I think I’ve realized fairly recently is, and we talk about this pretty publicly is that shared error budgets are actually a way to drive velocity rather than impede it. I think folks are often resistant to this idea of having error budget because they think that’ll prevent them from doing stuff. But what we’ve actually seen is when error budgets are implemented properly and people actually use the error budgets, it gives them more leeway to do stuff too. It gives them more leeway to push features faster, to experiment even, to do things like chaos engineering, for example. Because they know that, hey, we actually have this error budget available. Let’s use it to drive change in the system. Let’s use it to make our system more reliable. Let’s use it to test how reliable our system is. Let’s use it to do things like test whether our backups are working, whether our fail-over is working, whether our replication is working. And that’s the kind of stuff that over time makes your system more reliable, which will then again, free more error budget out for you to do further experiments and increase velocity. I think if you think about how folks are operating when they don’t have a good error budget is you just sort of have this constant state of tension versus pushing like, do we push something or do we not push something? Because we actually don’t know whether we should take a risk right now. And having a good error budget in place, you can look at it at any point and say, “Yes, we have error budget. Let’s do the thing.” Or, “No, we don’t. Let’s wait until if things are in a better place.” And so error budgets very much drive innovation and velocity. And I think I would take that message to the folks that are sort of having a kind of conflict over this and tell them to actually join forces, come together, agree on an error budget that’ll really help both sides.

Julie Gunderson: That’s great advice, Yuri. And it looks like it is about time for us to get back to PagerDuty Summit. So any parting words for our listeners today?

Yuri Grinshteyn: Well, first thank you for having me, I really appreciate it. It’s a real honor to be here. But also, yeah, think about, what is your service doing? How can you tell if your users are happy? Start with that and then everything else will follow.

Julie Gunderson: Fantastic. Well, thank you. And thank you for being here. And this is Julie Gunderson from the PagerDuty studio, wishing you an uneventful day.

George Miranda: That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show pageittothelimit.com. And you can reach us on Twitter @PageIt2theLimit using the number two. That’s, @PageIt2theLimit. Well, let us know what you think of the show. Thank you so much for joining us. And remember uneventful days are beautiful days.

Show Notes

The History of Alerting

Yuri discusses the history of alerting and why alerting should be on things that impact the customer experience.

Yuri: “It’s one of the soapboxes that I find myself on when talking to customers. It so often happens that customers will look for help with alerting; ‘How will I know if I am having an issue with my infrastructure, how will I know if I have high memory consumption?’ You should never alert on things in your infrastructure, you should only alert on things that impact your customer experience.”

Yuri and Julie discuss common mistakes customers make dating back to the beginning of alerting and how products like PagerDuty have changed the way alerting should be done.

A little bit about PagerDuty

Yuri and Julie discuss where PagerDuty came from and where it is today.

Yuri: “The fundamental problem that you are trying to solve, which is like hey I get 1,000 alerts an hour and I don’t know which ones are important. That’s not the problem we should be buying PagerDuty to solve, that’s the problem we should be addressing at the root.”

Julie and Yuri continue to discuss the issues with email filtering and essentially “training people to ignore alerts”

Service Level Indicator and Service Level Objective Usage

Yuri talks about how it’s about the service owner and who is ultimately accountable for reliability of that system and what user happiness means in this context. He continues to discuss SLI’s and SLO’s.

Yuri: “We use SLI’s as a proxy for user happiness.”

Yuri and Julie discuss setting up alerting with SLI’s and SLO’s in mind along with alerts needing to be humanly actionable, with a little bit of error budgets sprinkled in.

Yuri: “Things that are not directly contributing to or impacting user happiness, those should be created as tickets in a ticketing system… there is no need to wake someone in the middle of the night.”

Continued discussion around when alerting should wake someone up in the middle of the night.

Julie: “Making sure every alert that wakes a human up is humanly actionable sounds great but isn’t always easy, and it comes down to fine tuning. Do you have recommendations?”

Yuri: “People often feel that if they don’t have an alert for it, it’s not actually happening..”

Decrypting the Metrics

Julie: “Let’s go deeper on how do we really dig deep into what the customer experience means when you are looking at service level indicators and service level objectives”

Yuri: “You have to have a good understanding of what are people actually trying to do, and then some way of quantifying.”

The discussion continues on what metrics we use to quantify customer success and performance.

Yuri: “The closer we are able to collect this information to the customer, the more accurate it is going to be.”

Technical Debt

Shifting topics to how technical debt is expressed as a gap in knowledge, and how people treat their systems as a black box.

Yuri: “You’ll hear the term ‘black box monitoring’ because they don’t actually know how it works.”

Continued discussion on how technical debt manifests itself in monitoring.

Service Ownership

Yuri discusses what service ownership means to him and how it is really “engineer empowerment”, and what that means to reliability.

Additional Resources

Guests

Yuri Grinshteyn

Yuri Grinshteyn

Yuri Grinshteyn is a Customer Engineering Specialist at Google Cloud, where he works with customers to help them design reliable architectures and advocates for SRE practices and principles.

Hosts