Approaching Incident Response Compassionately With Connie-Lynne Villani

Posted on Tuesday, Sep 8, 2020
In this PagerDuty Summit Special episode Julie Gunderson talks with Connie-Lynne Villani, Director of Incident Management at Fastly about her upcoming PagerDuty Summit talk: Approaching Incident Response Compassionately and why it is important for both people and the organization.

Transcript

Julie Gunderson: Welcome to Page it to the Limit, a podcast where we explore what it takes to run software in production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson @Julie_Gund on Twitter. Today we have a special episode with Connie-Lynne Villani director of incident management Fastly.com. You can find Connie at @CLYNNEXX on Twitter. Connie is going to talk to us a bit about her talk on approaching incidents with compassion at PagerDuty Summit. We’re very excited for this talk. It’s titled Approaching Incident Response Compassionately. Don’t miss out on the full session at PagerDuty Summit, which is this year from September 21st through the 24th. It is free, virtual, and registrations are open. So Connie-Lynne to get us started, why are you excited about speaking at Summit?

Connie-Lynne Villani: A number of reasons. I’m so excited about this topic of compassion, and being kind to people, and actually taking this approach with integrity to work and to incidents. And I had been thinking about… I’d been kind of writing this talk secretly inside my head. And I had been asked to speak at another conference, and then that conference… And so I was like, “Great, I got this talk like half written. The night before, when I’m knocking out my slides, it’s going to be great.” And then that other conference happened to get canceled. And I was like, “Oh, but I have this idea in my head that I have to get out.” And so I think it was Amanda with PagerDuty reached out to me initially, but because you and I know each other, then you pinged me and you were like, “Hey, are you going to answer my friends question? Or sorry, my colleagues question or what?” And I was like, “Oh, yes, no, I want to get… I want to have this talk.” And so that’s really, is that I’m just super excited to get this notion that we can be effective and still acknowledge the humanity in ourselves and in our colleagues all at the same time. I want to get that out into the world.

Julie Gunderson: And this is something that we’re hearing more and more, especially as everything has changed so much in the last few months and about compassion and J Paul Reed talked to us about a socio-technical systems and thinking about our humans. And so this talk Approaching Incident Response Compassionately, talk to us about that. What does that mean?

Connie-Lynne Villani: What the real important thing here I think is, incidents suck. Nobody enjoys having to respond to an incident. Nobody enjoys having anything break, and everybody feels terrible about that. And yet we still, when we are doing incident response, and when we’re doing after action reviews, even if we’re doing blameless postmortems, we focus on the icky parts. We focus on everything that went wrong. And I feel like that doesn’t put people in a good place to learn. I don’t think it puts people in a good place to understand. And most importantly, even if people are not consciously thinking this during an incident, they’re thinking afterward in an incident, they’re thinking about, if I take accountability, if I’m accountable for this, if I’m held accountable for this, how unpleasant is that going to be? But the reality is, is that the rest of us fighting the incident a lot of the time, really do feel like we don’t care who or why somebody caused… If you think about how you feel when you cause an incident, versus how you feel when you’re helping to resolve an incident, most of the time… And maybe this is Pollyanna of me, but I actually don’t think that’s true. I think that people really are good and really kind, and what they’re thinking is how do we fix the problem? Oh, I feel terrible for this other person. And how do I solve things for this other person? And how do I make this better? And so you have to acknowledge that, and look at things from that perspective. And remember that everybody else is thinking, great, we’re going to solve this problem. We’re going to take care of this problem. We don’t like being interrupted, but we can solve it. And we feel really bad for the other person. And just to make yourself feel okay is really important. Because that way you’re open-minded when you’re coming to solve a problem, you’re happy to say, “Hey, I did this thing, or I did that thing. And that contributed to the incident, et cetera.”

Julie Gunderson: And I love that. Thank you. Because… Connie-Lynne, thank you for that. That hones in on something that we talk about a lot. You said the person, they feel like they caused the incident. And we talk about that quite a bit when we talk about how we speak to people, how do we make things blameless, and removing that root cause analysis. Because there are contributing factors. So people oftentimes can have that feeling that I caused this. Our systems are complex though. And there’s normally just not one cause.

Connie-Lynne Villani: Right.

Julie Gunderson: That’s why we talk more about John Allspaw’s Infinite Hows versus the five whys, when we talk about blameless.

Connie-Lynne Villani: I think that there’s a place for both of those, if you will. But I do agree that root cause, particularly when you’re talking about complex systems is an outdated concept. And so when I talk about people who caused an incident, the language that we actually try to use when we’re talking in detail about this stuff, is we say the trigger for the incident. And we talk about the contributing factors, because there’s almost always a thing that caused the incident. But just as an example, if you say somebody deployed a bad piece of code. Okay. So in that, what’s the triggering cause there? Even within that, you could look and say, “Well, the triggering cause is that we did a deploy.” Okay, whatever you have to deploy code, you can’t get it out there. You could say, “The triggering…” You can’t just… It’s the only way to get it out there is to write it, and get it, and deploy it. No other choice. You can say that it’s because somebody wrote bad code. Well again, people are going to write bad code. It happens. If we could automatically write error-free code that did exactly what we wanted, most of us wouldn’t have jobs. And you can say, “Oh, well, we didn’t have the appropriate checks.” Again. A colleague of mine, Paul Longo likes to say, “We’re always coming up with new and innovative ways to break the internet.” And I think that that’s really true. You can anticipate, and you can learn from previous mistakes, but there’s no way to write checks that will cover every situation. Particularly when you’re talking about operating at scale. So those are all things that might be the trigger. But obviously none of those is the root cause when you’re talking about professional, systematic root cause. None of those are the underlying condition that caused the event to happen. And so you can talk about five why’s. And that’s fine, digging into things and asking why and why. But again, that only works if you’re talking about a production line that has an actual sort of start and finish, and when we’re talking about computer systems. People display it as the cloud and the cloud, my friends and I like to talk about the clown instead. Which is this big colorful mess. And the question is really, is something a mistake or is it just happy little trees? And maybe it really is happy little trees in one case. But then in another case it was a mistake because it made life bad for other people. But it’s all interlocked. It’s all a big mess. And so yeah, Infinite Hows is another good way to look at that. For me, all of these methodologies are tools, ways to dig into how to make things better. Because again, Infinite Hows is still looking at how things went. And I do care about how things went wrong, because we want to understand what contributed to that. But we also want to, really what our actual goal is how do we make things better? And so that’s what we want to look at, is how do we make things better? How do improve things? How do we make it so that we don’t have this problem or a problem of this kind again, or that if we do have a problem that’s like this or similar to this, we know about it faster. We could fix it faster, developing additional tools because of a particular incident is really useful because that’s where you learn. You’re like, “Oh, observability, let’s let’s do this.” Or this could have gone much faster if we had something that would let us do something en masse, et cetera. And those are all questions, not of how did this happen, or how did this go sideways? It is, how do we make things better? And that’s really the main thing I want to do. And that’s where I think compassion comes in because you’re saying, you’re looking at the positives, you’re not focused on the negatives. The negatives happen, you acknowledge that the negatives happen. And that’s key too, because you have to acknowledge, you need to go back to that person who happened to be the one who wrote or deployed the piece of code that set this incident off, or the person who did the misconfiguration or the actual situation, the person who backed in a data center, backed into the main database server on three separate occasions during their first week. Still a dear friend of mine. We worked together 20 years ago. But that person is a person who feels terrible already. So how do we make things better? How do we make it so that fewer people feel that way in the future? How do we make it so that when people do that, fewer people are panicking. Fewer people are freaking out, everybody gets to solve a problem in an hour instead of 12 hours, nobody has to be woken up in the middle of the night. These are all the human side and acknowledging that we’re all actual people is the most important part.

Julie Gunderson: Well and tell us that. Let’s talk about that. How can organization make this a better experience for people, especially when we talk about incident response?

Connie-Lynne Villani: Right. So making it clear, I think that people are going to be supportive is the most important thing. And that the organization is there to support you as well, is the most important piece of how organizations can make this better. And there’s a lot of kind of little things that organizations can do. Obviously we all know that incidents run better if you’re clear on the SLOs and you’re clear on what the escalation paths are, and that everybody’s trained ahead of time, and this sort of thing. Those are all really obvious things. And those do make things better. Those are definitely part of compassionate incident response. But think about how you would staff any other project, think about staffing an incident in the same way that you’d staff another project. So do you have adequate people to be on call. Now, if your company is only a 50 person or a hundred person company, you probably have one or two people who are on call for incident response. And they’re probably on call 24/7, And that’s probably totally reasonable for the size of your company. But as organizations scale, you can’t do that. It’s not that people get worse at writing code or anything, when you get a bigger organization, but the code gets more complex, and you’ve probably got more services in play. And guess what? That means you’re going to need more people, because you’re going to have more incidents. So remember to staff up that. Don’t look at your incident response team as a cost center, look at your incident response team as somebody who helps resolve very quickly production impacting, product impacting, timeline impacting things. And to that end also make sure that in addition to adequately staffing things, you do not expect people to do terrible things like stay up all night fixing an incident, and then still get something out on time that’s due the next day or this even do that week. For some people being interrupted during their sleep, and not everybody, but for some people being interrupted during their sleep means that they’re wrecked for the next day. So be kind about that. And I think my favorite one is provide people with delivery food in the middle of an incident. For remote, this makes a lot of sense. And for remote, it’s more like, go ahead and expense this. But for in the office, if you’re not running the incident and you’re not involved in the incident and you want to help out your colleagues, go around, take food orders, order some food to be delivered and set it down at their desks. One of the best things that ever… One of the kindest things that ever happened to me was this was a few years ago. And when my husband was still alive, and I had woken up at like five o’clock on a Saturday morning and I was responding to some stupid incident. And I didn’t even have a cup of coffee. Because I was just there in the incident. He woke up an hour later, and just next to me appeared a cup of coffee. A few minutes later appeared a bowl of fresh fruit. Oh my God. Just knowing that I was cared for in that moment, and that somebody knew that I needed coffee, and that I needed food and I needed healthy food. That was wonderful.

Julie Gunderson: Cool. That’s amazing. And thank you. And I know you have a whole bunch more tips that you’re going to be talking about. So for everybody that wants to know more, and listen to an amazing talk, don’t forget to register for PagerDuty Summit. You can find Connie-Lynne’s talk on the agenda. And thank you, Connie-Lynne for joining us on this special Page to the Limit episode. And thank you all for listening. With that, this is Julie at PagerDuty, wishing you an uneventful day. That does it for another installment of page it to the limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes at Pageittothelimit.com and you can reach us on Twitter at Page it to the Limit, using the number two. That’s @Pageit2thelimit. Let us know what you think of the show. Thank you so much for joining us. And remember uneventful days are beautiful days.

Show Notes

Why Connie-Lynne is Excited about speaking at Summit

Connie-Lynne shares her excitement about speaking at PagerDuty Summit, and where her talk stemmed from. She talks about the notion of how we can be both being effective and still acknowledge the humanity within ourselves and our colleagues.

Connie-Lynne: “I’m so excited about this topic of compassion and being kind to people, and actually taking this approach with integrity to work and to incidents.”

Approaching Incident response Compassionately

As a continued topic on the podcast, Connie-Lynne and Julie discuss why it is important to have compassion during the incident response process. Connie-Lynne discusses the stresses on people when an incident happens, and how it can affect thinking.

Connie-Lynne “To make yourself feel ok is really important, because that way you are open-minded when you are coming to solve a problem.”

How Do We Make Things Blameless

Connie-Lynne talks to us about how there are contributing factors to incidents, and how avoiding blame is important, she shares with our listeners words she uses on her team at Fastly.

Connie-Lynne: “Root Cause, particularly when you are talking about complex systems is an outdated concept.”

She continues to talk about the language used in incidents, and using the words trigger and contributing factors instead of terms that point blame.

Talking about the Clown

Connie-Lynne talks to us about how there may be mistakes, but they can also be “happy little trees”. She continues to talk about how methodologies are tools to use, and how the goal is to make things better and learn from mistakes.

Connie-Lynne: “That’s what we want to look at, how do we make things better… and that is where compassion comes in, you are looking at the positives. You aren’t focused on the negatives, the negatives happen, you acknowledge that the negatives happen.”

To see Connie’s full talk on 9/23 register free for PagerDuty Summit at the link below.

Additional Resources

Guests

Connie-Lynne Villani

Connie-Lynne Villani

With degrees in both Electrical Engineering and Theater Management, Connie-Lynne brings more than 25 years of System Engineering experience to the table, as well as a keen understanding of how to handle drama in the workplace. Currently Director of Incident Management at Fastly, Connie-Lynne has worked at Linden Lab, Change.org, and Caltech, but admits that her most fun position is serving as a board member for the Grilled Cheese Invitational, an annual food festival celebrating all things cheesy.

Hosts

Julie Gunderson

Julie Gunderson

Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.