Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson @Julie_Gund on Twitter. Welcome. Today we are talking with Jason Yee at Gremlin. And we’re going to talk about all things chaos engineering, making sure your systems are reliable, and also being prepared for disaster. So Jason, to get us started, we’ve been having a lot of conversations, especially in the community and with our roles, about this new coronavirus that’s going around. And there’s quite a lot that we can draw conclusions from for how we prepare for something like this, to also making sure that our systems are reliable. But what are your thoughts with everything that’s going on?
Jason Yee: Yeah, absolutely. This whole Coronavirus, COVID-19, it’s a really interesting thing for me. As you mentioned, within what we do, the whole developer relations world, we often travel a lot. We go to conferences. There’s definitely been a huge impact aside from people wondering if travel is safe, we’ve seen conferences that have canceled. Just this morning CubeCon, which was supposed to be held in Amsterdam at the end of March, March 30th through April 3rd, I believe, or April 2nd, they decided to postpone. So CubeCon for Europe will be happening sometime slated for July or August, which hopefully gives enough time for whatever this virus is to die down or find a vaccine or a cure, so there’s definitely impacts there. So aside from impacting what we do as a job, I think it brings up, as you mentioned, there are other implications of how do you operate as a company when you’re impacted by outside forces, such as viruses and outbreaks, and what does that mean for things like disaster recovery or resiliency, not only for your systems, but for your people’s systems?
Julie Gunderson: Yeah. And one of the things with the conference cancellations that I find interesting is that while the conferences, some of them are completely canceling, we’re seeing a lot of conferences give people an online option, which it will be very interesting to see what the future looks like after people get used to being able to work in a remote environment or attend some of these events in a virtual way. I think it may change a little bit about how we do business. When we talk about disaster recovery, I know of a couple of companies that are looking to order laptops for their folks to help them be remote in case they need to. And right now, because of where things are manufactured, it’s hard to get your hands on those. So how do we translate that into future planning? What are your thoughts?
Jason Yee: That’s a great question. One of the things when it comes to future planning is how do you imagine failure to look like? I’ve been thinking about this and I’ve been writing about it as well, just taking notes and trying to gather my ideas, but there’s this phrase called a failure of imagination. Failure of imagination means that we often fail to think about ways that things could break. And in hindsight, they look very obvious, and so I think one of the interesting things that comes about because of things like Coronavirus, is the notion of what happens to my processes when people can’t be in the same room together, when people are remote? What happens to my processes when laptops can’t be ordered because they’re all coming from China and they won’t be able to be exported or imported? And so, I think that that brings about an interesting thing of, how do you think about failure and how do you imagine failure states to be? Because that’s really crucial to testing, without imagining what failure might look like, you have no idea what to test. And so, I’m curious from your point of view, PagerDuty obviously being a great service to let people know when things fail, how do you test for failure in PagerDuty? What are the things that you’re looking at and commonly what are the failures that you imagine?
Julie Gunderson: Well, one of the things that we do at PagerDuty is we do failure Fridays, which is our version of chaos engineering, and it’s something that occurs on Fridays, which a lot of people have strong opinions about. They get nervous over the fact that we choose to do this on a Friday, but at the end of the day, you should be prepared for failure to occur any day of the week. You should also be prepared to understand when you’re going into a chaos engineering experiment, what your expected outcomes are. And I think when you talk about imagining what that failure looks like, I don’t think anybody imagined the situation that’s occurring now with the virus, but it could change how we address chaos engineering in the future. At Gremlin, I participated in one of your workshops where it was more of an experiment, a thought experiment, we had some exercises on the trial version that we were able to do. Can we translate those types of learnings into less technical learnings and more learnings for something like this, that’s going on? What are your thoughts?
Jason Yee: Absolutely. So at Gremlin, we run a conference called Chaos Conf. Hopefully we will be one of the conferences that’s still running, but given the current situation, I think everything is up in the air, nobody knows the future. But at Chaos Comf last year, Dave Rensin who’s with Google, talked about chaos engineering with people. And he talked about running various experiments that they do at Google, such as putting someone on a surprise PTO, essentially someone would be told that, “Hey, don’t show up for work today. You have a vacation day.” And then the team would see how they operate without that person to see if they were a single point of failure, if they were a critical blocker, and to test whether the knowledge that that person had was distributed. Similar to any distributed system, having knowledge, having data in multiple places is useful for resiliency. They would use that to test is this person the single point of failure? Are they the only person that knows how to do a certain thing? And you really quickly uncover that when you send people off on vacation, that’s unplanned. He talked about a few other things, but yeah, really testing what you know, what you have documented. That was actually one of the other things was having a new team members or people from other teams actually run through your documentation to see if your documentation works. So, those were some fun ways that you could essentially chaos engineer with people that he mentioned. And I’m sure as we think of failing to imagine, if we actually spend time in imagine what our processes would look like by messing with the people in it, we could probably come up with some more interesting ones as well.
Julie Gunderson: Well, and I think that brings up an important part of chaos engineering, it’s about practicing. And so that person on vacation, just like you said, that can lead to understanding of the single point of failure. And I think one of the things that is really important and that can come out of it is practicing for that failure, practicing for that disaster, because the more you practice, the more comfortable you are if something happens, right?
Jason Yee: Absolutely. What you’re doing with chaos engineering is really building up those skills. And so I love that you use that word practicing because that’s exactly what it is. If you had a firefighter and all they did was respond to real incidents and never practice their skills and never get that training that they needed, then they would be less effective in the actual emergency, similar with really any emergency responder or any critical role or occupation. And so if we think about that, I love that idea of practicing because it really is building up those skills so that when the real outage or real disaster hits, we actually are prepared to deal with it rather than just responding.
Julie Gunderson: Well, and that’s another thing that we talk about too. Imagine that you just have the ideal system and you never have any incidents, so you go months and months between an incident. If you’re out of practice, then when it happens, do you know what to do? And do people know how to handle it in an environment that’s really free of the chaos that we work on learning?
Jason Yee: Yeah, absolutely. I think the other thing that, to tie this back to how we started with coronavirus that’s interesting is, not only the idea of having perfect systems and not having perfect systems, but the idea of being open and transparent about them. Because for those who haven’t been following the coronavirus drama, the virus first appeared in China and the doctor who first identified it raised a flag. He said, “Something is really wrong here. This isn’t a small issue. This is actually very serious.” And due to politics and things with China, he was essentially muted. His message was taken down. There was a lot of negative ramifications. And so information didn’t spread as freely as it should have in order to quickly address the issue. And I think that as we think of incidents, oftentimes we have this notion of, in certain organizations, trying to pretend that things are just fine. That there isn’t an incident. That our systems are okay, when in fact, they may not be. And understanding that systems are usually in some sort of mode of failure and being transparent about that helps us actually deal with them.
Julie Gunderson: Jason, that is such an excellent point. Right now, a lot of what’s being talked about with corona is trust in institution. Trust that information that we’re given is accurate and that people aren’t hiding anything. Interesting enough when that doctor originally alerted people back in December, it made the news for a couple of days, it was a blip. Because this is an area of expertise of mine and something that I generally follow epidemics a lot. I actually bought my masks back in September, just a couple, I didn’t stock up, but it was also about reading the early warning signs. And I think that again, that translates into how we operate as businesses. You came from Datadog, right? You’re very familiar with the monitoring world. And what can we learn from our systems? What can we learn from the past and what we’re seeing currently to prevent a major incident going forward? Do you want to talk about that a little bit?
Jason Yee: Sure, yeah. I mean, at Datadog it’s interesting because we break metrics or the signals up into three categories. One that we called work metrics and these were the higher level business goals, things that you actually want to throw alerts for, things that affect your customers or will affect, if you’re using a service level agreements or service level objectives, things that would violate those. And then we had a second group and that second group were the things that supported those higher level goals. And we called those resource metrics, so there are things like are the services that my application is dependent on, are those available? And depending on what team you’re on and what level within the organization you’re at, these may drill all the way down to things like CPU, disk IO, things like that, that were from the ops world were traditionally used to monitoring. The third was actually events. And so those are the things that you can use to correlate changes. But I think when you talk about early signals, it’s really we call those resource metrics that you would use as early signals. And so you would have this idea of what are the things that it can monitor that would essentially be what we now call SLI, service level indicators. What are those early things that I can monitor and take a look at that contribute to the overall objectives. And if I can monitor those indicators and get advanced warning on those, to see if something is potentially wrong, then I could potentially head off issues before I violate my objectives or my agreements. That said, I think with the virus, our objectives are to keep people healthy, to keep mass populations unaffected. And so it is interesting. What are these early indicators that doctors and epidemiologists and so forth are actually monitoring? Those are probably quite a bit different from what we’re looking at in systems, but I suspect that they have the same sort of idea of early indicators feeding into larger objectives. I’m curious though, from your background and expertise on this, what your thoughts are.
Julie Gunderson: My thoughts are almost, they’re a little conflicted. I think, first of all, I never imagined a conversation where we would be talking about viruses and incident response and chaos engineering all at the same time. But it’s very interesting how it all ties together. I think that it’s important for people to both not panic, but be aware that we’re at a tipping point. If you are out there reading the news, make sure that you’re reading multiple sources and reliable sources because there’s a lot of misinformation going on. And I think that it’s important that we all take some personal accountability. And if we’re not feeling well, making sure that we’re staying at home. When we talk about incident response at PagerDuty, it was actually born out of emergency management practices for firefighting in California. And it really is remarkable how it all ties together. I’m curious what their alerting process looks like when new case comes forward. How are they paging the right people, getting the right people on the call? I don’t love to make the correlations, but at the same time, when dealing with an incident, when dealing with a medical or environmental emergency, the practices are somewhat similar. You need to mobilize the right people at the right time and make sure that they have the context for what’s going on. And I think that there’s oftentimes those practices that you go through for emergency situations, practice drills, we’re not in a practice situation anymore. However, I think that we can already learn from other countries' responses to this and tailor ours and make sure that we’re using that data to keep more people safe.
Jason Yee: Yeah, I think it’s interesting when you talk about not panicking and having these practices. One of the interesting things to me has been the way that people have responded. I was out at dinner last night, chatting with some friends and they were talking about hand-washing and you’re supposed to wash your hands for 20 seconds. And so, one way that you can do this is apparently as you’re washing your hands, you can sing happy birthday, the happy birthday song, happy birthday to you, sing that twice and that’s roughly 20 seconds, or She’ll be Coming Around the Mountain. If you sing one verse of that it’s roughly about 20 seconds. And this idea of we’ve modified our behaviors to do these things when in fact to properly wash your hands, you should always be doing 20 seconds. If you actually go on and watch YouTube, there are people that show you proper hand-washing technique, because we often miss the backsides of our hands or the outer edges or our fingertips and things like that. So it does boil back to practicing correctly should be the same process as what you do in real life. It shouldn’t just be a response to like, “Oh, now we’re going to do a different process because there’s a real virus.” Or, “Now we’re going to do a different process because our critical systems are really down versus what we’re doing when we practice doing chaos engineering and things like that.” And all of that leads you to having a more informed and a better understanding of your processes so that you don’t panic when something actually goes out.
Julie Gunderson: Absolutely. I never would have thought about the fact that maybe we aren’t always washing our hands the right way and you’re right. We do. We wash our hands the same way over and over and over again for maybe 10 seconds, then something happens and then we start doing it right. Which I think is an excellent point when practicing for the real thing, make sure you are following the right methodologies. So what would you say some of those methodologies are?
Jason Yee: Yeah. In terms of the systems we build, the methodology has really come down to when it comes to chaos engineering make that practice rigorous. Come up with a good hypothesis, be rigorous about how you test that. And part of that is one reason that I love Gremlins. So at Datadog, we did a lot of game days and we actually did them very well. We had our own process, but we tried to make them scientific so that they’re repeatable. And the idea there is that when I inject failure, I need to do it in the same ways to have the same effect, because the idea is if I test a system and then I go back and I try to fix that, or I try to build resiliency into my application, I need to go back and test that again. And that test needs to be the same. If I’m manually doing this, then it’s not going to be as rigorous. And so all of the different contributing factors could change and it means that I’m now testing for something different, so I want to test for that same thing repeatedly. And then like anything within engineering, if you start doing something repeatedly and it has generally the same result, then you should automate it. And so, put that in your CII pipeline, that it should always test for this failure scenario or environment and get the same results that you’re monitoring gets thrown the right alerts, or that you’re able to see things or that your application just is able to deal with certain outages or certain failures in the environment. And then I think once you do that, and you’re running these game days, the other is to have a rigorous process around how you handle incidents, which I’m sure you could talk to you more, but things like having an incident commander and the processes that you follow. I don’t know if you want to talk more about that though.
Julie Gunderson: Well, that’s an entire episode all on its own, but I mean, again, it goes back to that practice, making sure that you’re doing things the correct way all the time. And I think that when chaos engineering comes into play, you’re making sure that all of your incident response practices are the right things. So PagerDuty’s incident response, it’s successful because we’ve iterated it on it over the years. We didn’t just do one thing and stick with it because it’s the one thing that we started to do. But we looked at, how is this working for us? What can be changed? What can be improved upon? \Which is exactly what we’re looking at with chaos engineering at the beginning. What about our processes needs to be more resilient? But one of the things that I wanted to talk to you about, we recently did a chaos engineering tabletop exercise workshop at PagerDuty summit with [inaudible 00:20:29] actually. And I was facilitating one of the tables, and it was just a bunch of people from different companies, and the resounding theme from quite a few people was, “We want to do chaos engineering, but our company won’t let us. They’re too scared.” Do you want to talk about that? Do you want to debunk the myth that chaos engineering has to be a scary thing?
Jason Yee: Absolutely. So that’s a great question, Julie. I think that the fear of doing chaos engineering and the how scary it is often reflects back to what you were talking about with Failure Fridays of like, “Why would you do this on Friday when things could blow up and now we have to stay late, or we have to work on the weekend to resolve this.” And it really is that idea of coming back to practicing. You don’t practice with real games. You practice little components. So for example, if I were a football player, I’d practice little components of throwing the ball, catching the ball, having good starts, hiking the ball, things like that. And then I would take those components and build those up into larger practices where I might practice plays or practice certain maneuvers or certain sets of plays. And then I build that up into scrimmaging, practicing against my teammates. And then eventually I’d get to real games. And it’s similar with Failure Fridays and chaos engineering, you want to start small, so you want to start in your development environment and with little bits of your components. Start with those easy tests of, does my application still work if I max out the CPU? And that’s just in development, so it’s a safe place. And you build that up until you get comfortable that my application does survive with this small test of spiking CPU. And then from there, you can move up into your staging environment. And again, you’ll start small and you’ll build up until you’re comfortable. And then from there, you push that into production and you’ll run those tests there. It’s not that different from pushing code. Pushing code directly to production is just as scary as starting chaos engineering directly in production and you shouldn’t do either. You should develop in a development environment and get comfortable and have tests and be okay with that code and then push it into your staging, and then from there, push it into your production. So when people say that it is scary, well, I think you’re coming about it in the wrong way. It’s obviously scary if you do bad practices and work directly in production from the get-go. But if you follow traditional practices of starting in development, just like you would with your code, it’s not that scary. You’re following a process that’s been tried and true for many decades. That said you could potentially end up in production and break things because production never completely matches staging and usually never matches development. But that said, you’ve built up those skills through this process, both the incident response skills and the chaos engineering skills, you’ve built them up to be able to handle those incidents in production. And when you’re doing it in production, you have people around to help. It’s much better to do it in a game day with chaos engineering than to have it naturally happen, because that’s when PagerDuty wakes you up at 3:00 AM is when you haven’t practiced for these things, and you’re not doing it in a controlled environment. So that’s really the advantage of chaos engineering and why it shouldn’t be scary is building up to these and having people around and declaring that it’s a chaos engineering experiment versus having them happen organically.
Julie Gunderson: Thank you. I like that. I like the building up. I think that, that’s an important piece is to take those baby steps. What’s the one thing you wish you would’ve known sooner when it comes to running software in production?
Jason Yee: That’s there are so many things that I wish I had known sooner, but I think the biggest for me is that failure is okay. I think a lot of times in my early career, not understanding that failure was okay, caused me to spend a lot of cycles that were maybe unnecessary trying to account for failures or error scenarios that probably would have never happened while ignoring the easy ones that they could have. So ended up over-engineering a lot, not iterating as quickly as I could have simply because of thinking that failure was completely unacceptable.
Julie Gunderson: And that’s a good point, failure is acceptable. That’s how we learn. Jason, thank you very much for your time today. It’s been wonderful talking to you. I know we talked about quite a few things, including really what’s going on in the world and how it’s affecting all of us. And I think it’s interesting how it relates to everything that we do today. Do you have any parting words for us?
Jason Yee: I’d just like to thank you for having me on the podcast. It’s been fantastic. Also, I always enjoy talking to you, especially with how much of a epidemics nerd you are. I always learn something new about the state of our world and sometimes how fragile it is. But I think, overall how resilient we are as people. So yeah, happy to be on the show. Thanks again.
Julie Gunderson: Well, thank you. And this is Julie Gunderson wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. That’s @pageit2thelimit. Let us know what think of this show. Thank you so much for joining us. And remember uneventful days are beautiful days.
Julie and Jason talk about how COVID-19 is affecting the conference industry and how we are adapting our work.
Jason Yee: “There are other implications of how do you operate as a company when you are impacted by outside forces such as viruses and outbreaks, and what does mean for things like disaster recovery and resiliency and not just for your systems but for your people systems.”
The conversation turns to how things are moving to online options and remote work.
Jason and Julie talk about how you imagine what failure looks like.
Jason Yee: “Failure of imagination means that we often fail to think about ways that things can break, and in hindsight they look fairly obvious.”
Jason goes on to talk about how do you think about failure and how do you imagine what failure states look like. Jason and Julie talk about how PagerDuty tests for failure through Failure Fridays, and how we can translate learnings into less technical learnings. Chaos Engineering with People
Jason talks about the practice of chaos engineering with people, and ensuring that knowledge is distributed.
Jason Yee: “If we actually spend time and imagine what our processes would look like by sort of messing with the people [vacation and schedules] in it we could probably come up with some more interesting ones as well.”
Julie and Jason talk about practicing for failures and disasters and how practicing leads to comfort and the reduction of chaos in actual emergencies and incidents.
Jason and Julie talk about what we can learn from our systems and the past and how we can learn from those things moving forward. Jason talks about the three categories: Work Metrics, Resource Metrics, and Events and how early indicators feed into larger objectives.
Jason Yee: “What are those early things that I can monitor and take a look at that contribute to the overall objectives, and if I can monitor those indicators and get advanced warning on those to see if something is potentially wrong, then I could potentially head off issues before I violate my objective or agreements.”
Jason and Julie talk about how proper handwashing is 20 seconds and the songs you can sing while doing this, but more so how it’s about practicing doing things the right way so that when you are in an emergency situation you don’t have to retrain bad behaviours.
Jason Yee: “Practicing correctly should be the same process as what you do in real life, it shouldn’t just be a response to like “oh now we’re going to do a different process because it is a real virus” or “now were going to do a different process because our critical systems are really down” versus what we are doing when we practice chaos engineering.”
Jason talks about the methodologies behind chaos engineering.
Jason Yee: “In terms of the systems we build the methodologies really come down to, when it comes to chaos engineering; make that practice rigorous, come up with a good hypothesis, be rigorous about how you test that.”
Jason continues talk about how you need to test in a scientific and repeatable way, and how you need to do things in the same ways to have the same effect when you are testing.
Julie and Jason talk about concerns organizations can have around chaos engineering, and how chaos engineering doesn’t have to be scary when you implement the right methodologies.
Jason Yee: “With Failure Fridays and chaos engineering, you want to start small. So you want to start in your development environment, and with little bits of your components.”
Jason gives us advice on how to build up to staging environments and to production with chaos engineering.
Jason Yee is Director of Advocacy at Gremlin where he helps people build more resilient systems by learning from how they fail. Previously, he was Senior Technical Evangelist at Datadog, a Community Manager for DevOps & Performance at O’Reilly Media, and a Software Engineer at MongoDB. Outside of work, he likes to spend his time collecting interesting regional whiskey and Pokémon.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.