Quintessence Anx: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Quintessence, or QuintessenceAnx on Twitter.
Today we’re talking about Liz’s upcoming talk, Cultivating Production Excellence at PagerDuty Summit at the end of this month. We’re joined by Liz Fong-Jones, principal developer advocate at Honeycomb, where she makes developers, operators and workers on the whole more productive and empowered. Her background is in site reliability engineering with over 16 years of experience and has worked on a few products and services you may have heard of, like Google Cloud’s Load Balancer and Google Flights. Liz, welcome to the show.
Liz Fong-Jones: Thanks for having me.
Quintessence Anx: Getting right to it. We’re looking forward to your talk at Summit. Can you tell us a little bit about what inspired it?
Liz Fong-Jones: Yeah. I spent over 11 years at Google as a site reliability engineer, and over the course of my time there, I realized that there’s this gulf between what Google was saying was best practice, what Google was actually doing. Then what people were interpreting based off of the SRE book, based off of people who had left Google. There’s kind of this spectrum of experiences with production. I really wanted to kind of distill down those 11 years of experience at Google into something that was more concrete and practicable. Four quick things that you can do, or not so quick things, four things to work towards as we develop your culture.That’s kind of where this idea of production excellence and four principles of production excellence came from.
Quintessence Anx: That’s awesome. Related to that, what common myths or misconceptions about production did you find yourself answering from that experience?
Liz Fong-Jones: I think the most common misconception that I run into is people thinking, “Oh. Production has to be a nightmare. Production has to be a mess. This is why we pay ops people,” right? It’s this culture of masochism on the part of ops people or a culture of, almost like abusive behavior or just tolerance of bad behavior where people assume that production has to be bad. Therefore, that production has to be run by people that you pay to sit in the trenches. I think that that misconception was really, really thoroughly debunked for me at Google when I saw that there were teams that were just running their own services.
Liz Fong-Jones: That you didn’t need to have a network operation center. That you could have services that behaved perfectly fine and weren’t generating a lot of noise and that, as long as you had the right setup and you equipped people correctly, that people could actually be on call and not have it ruin their lives. I think that’s the number one misconception, is that production is scary, production is going to ruin your life. I’m here to tell you that no, it’s not going to ruin your life, but you have to clean up your service a little bit.
Quintessence Anx: That all makes a lot of sense. I could definitely see how painful experiences train your brain to think that it’s just going to be more pain. Speaking of pain resolution, what do you feel are the biggest mistakes people make when adopting SRE practices?
Liz Fong-Jones: I think the biggest mistakes people make when adopting SRE practices are, number one, they try to do everything that Google says exactly by the book. That clearly is not going to work, because your company is not Google and therefore different things are going to apply. Secondly, I tend to see that people try to adopt SRE all at once within their organizations. Especially in large enterprises, they have a leader who’s like, “We’re going to do SRE. We’re going to do SRE. Everyone do SRE.” It doesn’t really work that way. You have to start small. You have to expand out from there. You have to build up that model of what does a good SRE team look like in your organization before you try to scale it out.
Quintessence Anx: That makes sense. When people are trying to scale it out, is there anything that you want to expand on for that last?
Liz Fong-Jones: I think with regard to scaling it out, you have to develop what I think of as a community of practice within your SRE org. That doesn’t necessarily mean that everyone has to report to the same manager. Everyone has to be on a dedicated team, but instead, you have to think about how do you get these insights and knowledge shared between your various people who are fulfilling that SRE job role. Even if they’re not necessarily called that. In terms of their job title. How do you make sure good production practices are spread throughout your organization and people have somewhere to start, if they’re just getting started.
Quintessence Anx: That makes sense. It sounds like leaning in a little bit, lots of collaboration, how does this collaboration help improve observability? Basically the end result, right?
Liz Fong-Jones: Yeah. So we haven’t yet touched on what observability is. So let’s quickly define it for our listeners here. Observability is the ability to debug your system. The ability to ask any question about your system and understand what’s happening inside of your system without necessarily having to push new code. So observability is this, not just the tooling, it’s a socio-technical capability. It’s the ability of the people and the existing tooling working together in order to solve these problems. I think that’s kind of one dimension of collaboration, which is something that I’ve definitely heard described by my friend, Jessica Care as this idea of people and systems working together. I think that’s one dimension of collaboration. Then the other dimension of collaboration is collaboration across team boundaries. You shouldn’t have someone else’s service be a complete black box to you. You should have tracing or some kind of other capability to really tie your services together, both in terms of having your RBCs flow to each other, yes, but also being able to debug when those RPCs go wrong. I think that’s really how collaboration improves observability, is erasing some of those boundaries between services that would otherwise be opaque or hard to understand, or unwieldy when it comes to automation and systems working together.
Quintessence Anx: That all makes a lot of sense. Going back to your definition of observability, I know that you spend maybe a little bit of time talking about it, but when people are talking about asking questions of their system, do you have anything off the top of your head that can help people kind of understand what that means? It’s not a human dictated-question, it’s a human-dictated question converted to a machine question somehow, where you can actually get the right data.
Liz Fong-Jones: It’s this stream of consciousness. It’s this dialogue with your system. So when I get paged, the first thing that I try to figure out is how bad is it? Who’s impacted? Then from there, I start going into questions like why is this happening? Is it because this component has failed? Is it because we pushed a new version? Is it because this one user is sending us this really weird combination of traffic? Is it some combination of these things? Those are some of the questions I might be seeking answers to that help me debug and narrow down what’s happening inside of my system, so I can better mitigate and solve the issue.
So I think that when we’re trying to formulate these things, often our systems tend to hard-code us into only being able to analyze it along specific dimensions. Only being able to analyze things by service or only being able to analyze things by machine. It turns out that user behavior is a much more interesting spectrum of behavior. So you have to be able to understand is it because user one, two, three, four started issuing more queries? Can you break down by users? I think that that’s kind of where we start getting into the differences between traditional monitoring-based approaches and observability, where you can actually be in this dialogue with your system and iterate on your queries.
Quintessence Anx: That makes sense. When we’re talking about user behavior and things like that, how do you make sure that the questions are appropriately, I guess, granular? You can think you’re pulling data that helps you, only to find out maybe it’s too broad, too specific. You’re not getting out of it what you thought you were going to.
Liz Fong-Jones: Yeah. I think that there’s two answers to this. One of which is that, first of all privacy has to be paramount. You cannot have an audited access to production. People should be able to make sure that you’re only using your access for appropriate purposes. That means not inappropriately digging into user’s data. I think that’s one dimension. I think the other interesting dimension here, with regards to making sure you’re getting the right data, is always being able to check and verify your assumptions. Can you cross check your hypothesis and make sure that it holds up from a different angle?
For instance, if you think that it is only one user that’s having a problem, then you might cross check to make sure is the increase in error rate actually correlated with this user sending more traffic? Can you actually specifically, not just correlate it in time, but specifically see that user’s error rate spike? Can you make sure that it’s happening across multiple data centers and that’s not just an artifact of, “Oh, this user’s traffic happens to be going to this one data center that’s having a problem”? You have to approach it from multiple different angles to really get that kind of three dimensional shape.
Quintessence Anx: That makes sense. Getting into how people can design around that, people often treat their alerts around all these things as a safety blanket. How do you convince people it’s okay to remove something that’s too noisy, speaking of scope and things like that?
Liz Fong-Jones: Yeah. People tend to assume that just because a dashboard or an alert caught something the last time that it’s going to be good indefinitely into the future. That’s not necessarily true. Alerts and dashboards definitely have a shelf life. They’re good for catching things that are specific to one thing that bit you in the past, but your next outage is not going to look like your past outage necessarily. As these things accumulate and accumulate, eventually you wind up overpaging, where people get woken up for things that aren’t actually impacting system behavior. So you kind of have to pull the analytics out of PagerDuty. You have to say, like, “We got paged a hundred times over the past month.
That’s three times per day, and of those, 50 were not actionable. We didn’t do anything. We just turned it off and went back to sleep.” That’s not necessarily healthy. There’s a limited number of pages that you can cognitively interpret before your brain starts becoming overloaded and you just start ignoring or doing a bad job on everything. I think that that’s the lesson to learn. Is that over-alerting is equivalent to not alerting at all because you start treating all the alerts as noise. That’s kind of how I encourage people to approach it, is this idea that observability helps you debug from first principles. You kind of only need a few basic alerts and you don’t need a granular alert for everything in the system. That enables you to have that cognitive bandwidth.
Quintessence Anx: That makes a lot of sense. I know that some of the things that we talk about at PagerDuty, of course, is alert fatigue. To that same point, where you have too much of a good thing and now it’s not a good thing anymore. Something that comes to mind is how frequently, or not, to flush out those alerts. Not necessarily outright delete them, but how to iteratively edit them, and sometimes it depends what the setup is. Does it make sense to review it after an incidents resolved and say, “Hey. This helped me or not, or maybe on a sprint cadence or something”? Do you have any thoughts or recommendations on how often to edit, entropy balance your alerts?
Liz Fong-Jones: I think every two to four weeks as appropriate, because past that you lose the context of, “Hey. That alert was super spammy, because the person who is on call doesn’t even remember being paged for it.” It has to be relatively fresh. I think that sprint cadence is entirely appropriate, but I think along that same line, as you were saying, if you have a major incident, you might have a retrospective. Your retrospective might say we think we should add an alert for X. Revisit in six to eight weeks, make sure you still feel that way.
Quintessence Anx: That makes sense. Pivoting a little bit, but relevantly, is there anything you like or dislike about alerting platforms as they are now in general?
Liz Fong-Jones: One of the favorite things that I had access to when I was at Google was the alert manager system at Google, which was very much infrastructure as code. As the primary way of interacting with it. So you could kind of see who is going to be on call at a given time. You could edit the text file and submit a pull request to put someone else on call. I think that programmatic access is something that I wish were a lot easier outside of Google. Overall, I think that having people start dispatching alerts, having people have easy ways of escalating is an enormous change that has enabled a lot of more legacy enterprises to move towards a more DevOps-y model. I’m really excited about it.
Quintessence Anx: That makes sense. Okay. Talking a little bit about Honeycomb. How is Honeycomb adapting to COVID? Since that’s on everyone’s mind.
Liz Fong-Jones: One of the funny stories is that last year we started doing these disaster drills where we would all pick a week where everyone worked from home and the company would be full remote. This was in July, August, September of last year. When COVID came and we all had to work remote, everyone had at least some exposure to working remotely, which was really, really interesting and made our lives a lot easier, because all of the remote access challenges that people were struggling with outside of Honeycomb were not really giant issues for us. It meant that we knew how to talk to each other in Slack.
Liz Fong-Jones: We knew how to debug incidents without sitting in the same room together. Definitely I think as this has wound longer and longer and longer, I am definitely starting to miss seeing my colleagues in person. I’m starting to miss physically sitting over the same whiteboard and creating something with them. I think that’s been the main challenge for us, but it’s definitely been an exciting time for us at Honeycomb, given that we really have had customers that are doing critical work related to either COVID research, including the Chan Zuckerberg Initiative, as well as productivity apps like Slack that are using Honeycomb. It’s been very, very important, even more important for us to be available to help people debug their systems so that the whole world can stay productive during this.
Quintessence Anx: That makes sense. You mentioned about having a good short term plan for how to, to your point about disaster scenarios and practice, but since this became a longterm disaster, rather than a short term one, is there anything that you’d like to talk about with how you manage things that changed beyond a week or a day?
Liz Fong-Jones: Yeah. The scenario we were all imagining in August and September originally was like, “Hey. We might lose the lease on the office because we didn’t necessarily have enough money to pay a new security deposit when our lease ran out.” Then it was like, “Hey. This might be helpful if there was an earthquake.” Or the Bay Area gets taken out of commission, but instead it’s everyone that’s working from home. I think that, as you were saying this, this has gone on longer and longer, work/life balance is increasingly hard. It’s hard to actually peel myself away from the computer when I’m just working, working. Times zones also. When you work three hours apart from the rest of your colleagues, the temptation is there to just sit and work and work and work. That’s not necessarily the best idea in the world. Yet it’s a thing that, without clear boundaries between work and not, it’s kind of hard to navigate. Then I think the other thing is, personally, I’m in the middle of moving in a pandemic, which is also fun and hilarious.
Quintessence Anx: Yeah. I actually recently did my own move and I think mine was a bit shorter in terms of distance than yours, but it’s definitely a logistical thing.
Liz Fong-Jones: Our training as engineers who focus on operations is really, really helpful for that. In that it allows you to break things down into pieces. Delegate tasks, have an incident commander. My spouse and I were planning to move and we were like, okay, this is how we’re parceling out the roles. This is who’s on call for doing what. That’s made life a whole lot easier. So that bleed over between work skills and personal skills is really, really helpful to have.
Quintessence Anx: Yeah. It is amazing how we can tie in our work life into our home life a little bit. Getting back to the boundaries, I guess, of work/life balance. It can be hard. I mean, I know some people when I see the Twitter and Doom scroll a little bit, people talk about, “Oh. Well, I have a home office or I don’t have a home office,” but some people, they’re in a smaller living space. They have multiple people to that living space. They have small kids in that living space or whatever, and it interferes with that boundary. Is there anything in your experience that maybe you do that can help people separate work from home when work is home?
Liz Fong-Jones: Yeah. I think that the number one thing that I’ve done there is that for the computer that I use for my gaming and personal stuff, I have taken work Slack off of my Slack client there so that when I go to play computer games, when I go to chat with friends, that I’m not seeing that white notification light on the Slack. If I really need to access work Slack from that computer, I’ll open it up in a web browser. I think that’s the number one life hack there. Similarly, we all are pretty much at our desks, or sorry, we’re pretty much all at home anyways. We’re not on the go.
So why is it that you have your work Slack installed on your phone at this point? If you’re working, you’re going to be at your desk. If you’re not working, if you’re, as you were saying, Doom scrolling on Twitter, you don’t want that work notification to pop up. Seriously, take work Slack off of your phone. You do not need it right now. If there’s something important, you’ll get a PagerDuty notification because someone needs you. So having that trust that someone can get hold of you when they really need to will enable you to decouple and not pay attention when you want to go do something else.
Quintessence Anx: Awesome. Real quick, before we wind down, speaking of people getting a hold of you, establishing that trust, is there just a bit of advice you have for people for next steps or steps they can take to build that trust if they’re in a situation where they’re cleaning up their alerts or they’re in the process of making change, but they’re not there yet, they’re just walking towards it, that can help them get there faster?
Liz Fong-Jones: I think that as you’re working towards cleaning up alerts, you have to apply to the Marie Kondo thing. Does this bring me joy? If it doesn’t bring me or my users joy, you should get rid of it and setting those top level alerts to know when real users are having pain. Like high error rate alerts are super helpful at top level, because then you will know, even if you eliminate some lower level alert, that you would get alerted if there were a serious problem affecting your end users.
Quintessence Anx: Okay. Was there anything else you really wanted to highlight from your talk, since everyone’s super excited to learn from you in a couple of weeks?
Liz Fong-Jones: No. I think that we teased elements of the talk quite well. You’ll hear more from me about the subjects of observability, service level objectives and collaboration during my talk.
Quintessence Anx: All right. Well, that’s awesome. For everyone listening in, please make sure to catch Liz’s talk at PagerDuty Summit. It’s going to be at 9:50 AM Pacific Time on the 23rd of September. Also make sure to check out Liz’s office hours, and we’re going to be posting links to these in our show notes, as well as her blog post relevant to her talk, which is Sustainable Operations and Complex Systems with Production Excellence at InfoQueue. Before we head off, we have two questions that we ask every guest. Are you ready?
Liz Fong-Jones: Sure. Hit me.
Quintessence Anx: All right. What is one thing that you wish you would’ve known sooner when it comes to running software in production?
Liz Fong-Jones: Making mistakes is normal. Having your first outage is completely normal. The sooner you get your first, “I was responsible for that outage,” out of the way, the more comfort you’ll get in production, so don’t be afraid to make mistakes.
Quintessence Anx: All right. Awesome. Is there anything you’re glad that we did not ask you today?
Liz Fong-Jones: I am glad you didn’t ask me about what specifically Google says it does that it doesn’t do. I think that that’s a fascinating topic with lots of skeletons in that closet.
Quintessence Anx: Fair enough. Liz, thanks again for joining us.
Liz Fong-Jones: It was a pleasure. Looking forward to speaking at PagerDuty Summit.
Quintessence Anx: All right. Awesome. We’re looking forward to hearing from you. This is Quintessence wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @Pageit2Limit, using the number two. That’s Pageit2Limit with the number two. Let us know what you think of the show. Thank you so much for joining us and remember, uneventful days are beautiful days.
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.