Mandi Walls: Welcome to Page It To The Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @LNXCHK on Twitter. All right, welcome back folks. This week talking to a very long ago former coworker, Matt Davis. Matt, welcome to the show. Tell the folks about yourself and what you’re up to these days.
Matt Davis: Hi there. Thank you for having me, Mandi. A little bit about myself. I’ve been in this business, yeah, a long time, Mandi and I know each other from way back and I haven’t left the business, so I’m still here.
Mandi Walls: Right? We’re all still here.
Matt Davis: And you’re still here. So I’ve done a lot of different things. I’ve done a lot of storage. I grew up in data centers. I built a lot of data centers and I did a lot of stuff in storage, storage systems, storage SaaS. I worked for one of the first storage SaaS companies ever and databases. I think a lot of that led me into databases, which led me into ad tech. So I was in ad tech for a little while, and then after ad tech, which is where I got most of my SRE education, dived into a few more startups. So it’s my sweet spot are these mid-level startups that maybe have been there a while, maybe a couple 100 people. That seems to be the place where I’ve been able to do the most work. What am I doing today? I do a lot of resilience engineering types of stuff. So in the past five years I’ve been an active member in the learning from incidents community, doing a lot of talking and speaking and some writing about resilience engineering, especially what we would call “applied resilience engineering,” which is what we’re going to talk about today. Another thing that I really love about my job is the way that… I like to tie music into what I do. I think that everybody has music in them and I think all humans relate to music, and so when I try to help people understand systems, understand complexity, understand how to work together, I try to use music as an example. And that really resonates with a lot of people so I keep doing it.
Mandi Walls: Yeah, that’s awesome. We might come back to that as I have an off topic thing that I might have to ask you a question about later, but let’s get started with what we’re going to talk about today, which is continuous learning. Like you mentioned the learning from incidents community and that as a practice really, but there’s a whole lot more to thinking about continuous learning and stuff like that. So how would you describe continuous learning for folks who are new to it, haven’t heard of it before, aren’t sure where to go with that?
Matt Davis: I think the best way to describe continuous learning is to contrast it with what is called mass learning. M-A-S-S, like a mass of objects. So a mass learning is what you’re probably used to. You’re a new employee at a company. You go into the company and you have five days of orientation, but then you never cover that material ever again. So that’s known as mass learning. You take all of the mass of knowledge that you want to get into people’s brains, and you do it all at once. So continuous learning is the opposite. You can think of continuous learning as, “Let’s break up that mass learning and let’s spread it out over time.” So what we get then is you get instead of one big thing that people get trained on once and then never again until they have to re-up their training or whatever, you’re constantly learning new things and not only that, but you’re building on what you learned last time. That’s the main goal of the continuous learning cycle is that you’re refreshing what you learned last time, you’re building on what you learned last time, and it helps you really learn that. It helps you really embody what you’re learning, I think.
Mandi Walls: Excellent. How does that help folks with resilience engineering and those kinds of approaches to technology? Is this something new for folks to apply to this?
Matt Davis: Well, I think it is new. It’s a new way of thinking. It’s a mindset. It is a new mindset. So I think this traditional mass learning way of training new employees, going on-call is another perfect example. And we tend to do this. We tend to even expect it. “Oh, I haven’t done the on-call training yet.” Have you shadowed someone? So that’s a one-time thing. You only get to shadow once. I think the way that I would put this, and I’m going to speak in terms of on-call because it’s very pertinent, is that, well, first of all, the knowledge doesn’t go stale. Depending on what you’re training for, let’s take a salient example, PagerDuty. So this is something that I come across a lot, believe it or not, when people get on PagerDuty, it’s very easy to just use PagerDuty in whatever way you want to. It’s actually one of the values of PagerDuty is that it doesn’t lock you into an opinionated way of working unless you want to.
Mandi Walls: Yes.
Matt Davis: A lot of people just don’t know that, and in fact, I don’t think you can even learn everything that PagerDuty can do in one sitting.
Mandi Walls: No, there’s so much stuff in there.
Matt Davis: Yeah, there’s so much different… Are you using the Slackbot? Are you using the incident pieces? Are you actually managing incidents in PagerDuty? Oh, here’s one I love is how are you defining services in PagerDuty?
Mandi Walls: Oh, that’s a good question for lots of folks. Yeah.
Matt Davis: Yeah, and it’s like, “Okay, well, yeah, we can all get on PagerDuty.” Everyone has accounts put in. It’s an on-call rotation. It’s an alert. But if you really want to learn to use the tool, the way the tool is meant to be learned, it needs an iterative approach. It needs practice. Being able to know what Slackbot commands does PagerDuty have that I can use during the incident. It would be nice to know that before you have to do it in an incident, wouldn’t it?
Mandi Walls: True. That’s what we hope, right?
Matt Davis: Yeah. So I think it really helps people latch onto the pieces of their job that they don’t see very often. On call is one of those things, and you may not even be on-call every four weeks, every five weeks. I don’t know how big your teams are. So you forget all of that. You’re doing your development, you’re doing your feature work, and you forget the process. So the way I look at it, if you do it iteratively, then it’s just a process that builds and it becomes salient or it becomes embodied in what you do and it’s automatic when you get into the incident.
Mandi Walls: You get some muscle memory around that stuff that you don’t if you only touch it once. Yeah. There’s a lot of practice and things like that that come into it. So all of that takes time, right? It’s investment, it’s resources that folks need to be very intentional about, I guess?
Matt Davis: Yeah, intentional is a great word for it because one of my pet peeves, I’m going to call it a pet peeve, I may as well, is that people, teams, orgs, I don’t know which one you want to pick depending on the context, look at incidents as extracurricular activity. So they’re not, they’re part of our jobs. So wouldn’t it make sense to be trained on a part of our job? And I think that might be one of the places where this gets confused is that if you look at incidents as something that’s extra that you have to do outside of your job, then you’re never going to be incentivized to work on learning inside your job. It all has to be part of everything.
Mandi Walls: That’s one of those fun things. People always want information from us or recommendations from us. You’re going to bring teams on to on-call. How do you remunerate them or how do you put this together? And years and years ago, some of the bigger players in our industry were like, “Well, we’ve got this huge pool of X number of SREs or whatever they are, and on-call is voluntary.” I’m like, “That seems bizarre. And yeah, you’re going to give them a bonus for that, but don’t you want everybody to know how to handle when things go wrong?”
Matt Davis: Yeah, and learn. Incidents are the perfect opportunity to learn about the system, and that’s why I say they’re part of our jobs. We know they’re going to happen. I had some friendly fights with an exec a few months ago who floated this idea that there could be something known as no incidents. Can we get reliability to the point that we have zero incidents? Well, if we have zero incidents, then we don’t have to spend time to train people to be on-call. That’s just the wrong way to think about it.
Mandi Walls: You got zero incidents, a pretty good chance you have zero customers, but…
Matt Davis: Yep. Or you haven’t had a customer yet. I don’t know.
Mandi Walls: Right. You are pre customer. Absolutely. Yeah. You’re still [inaudible 00:09:36]. When folks want to be intentional about it, how do they go about… Do you create a program for it? Do you hope it happens a little bit organically? How do you guide folks from different teams into these kinds of practices?
Matt Davis: It can absolutely be done organically, and in fact, that’s the way that I would recommend doing it. If you’re going to try to, especially depending on the leadership or whatever, if you have to try to get permission to do something like continuous learning, continuous training, whatever, it might get a little bit difficult to do that if you have to get all kinds of approvals and don’t even worry about that. Start a guild, start a special interest group, start a community of practice. These are all things that employees can do. You don’t need managerial approval to start doing continuous learning. You really don’t. For example, let’s say that you have an environment where management doesn’t really look at wasting our precious engineer time on something “soft” like learning. Well, then don’t talk to management about it. Just get together as a community of practice and look at it that way. And start off with a book club even. That’s a great way to start it, actually. A book club… I do this thing where I don’t call it a book club. I call it a read club.
Mandi Walls: Okay.
Matt Davis: Yeah. So it’s the same-
Mandi Walls: So you’re dropping out the wine part of the hanging out and getting tipsy for the book club part of it.
Matt Davis: Right, right, right. So there’s so many great papers out there that are just… They’re nice and short and they’re easy to digest, and it’s a perfect way to start people talking to each other because that’s what we want to do. In addition to learning the declarative knowledge, the procedures that you have to do on-call that stuff, it’s like a musician has to learn scales. You have to learn scales, you have to go into the practice room, practice your chops, practice chord progressions, whatever it is, you need to do all that. But when you get together, you’re sharing. And that’s how something like a read club, a great place to start is Dr. Richard Cook’s paper, How Complex Systems Fail.
Mandi Walls: Yes.
Matt Davis: Yeah. There is a website, how.complexsystems.fail, and there’s also a paper just search for Richard Cook, Complex Systems Fail. It’s an 18, I would call them axioms. That’s how I look at them. And they’re these truths of complex systems, and even though you read each one individually, they do build on each other several times. This is a great paper to start with because it’s short, and each of those little axioms is literally only a paragraph. Make a date, everyone get together to read it. Pick one that you want to read, pick one that you want to talk about, and then you’ll find that engineers, SREs, developers, even non-engineers, when they start to think about the kinds of things that Dr. Cook puts in that paper, they start to think about the work they do, and that starts to come out, and it’s awesome to see because there’s a big paragraph in there about how root cause is never the reason behind complex system failure. So that starts people thinking because in the other paragraphs, Dr. Cook sets that up for you. So by the time you read, “Root cause is the wrong way to think about complexity,” you already have this background, so it will make you think about all different kinds of things within your own complexity. I love the paper, obviously. But there’s other great papers, Lisanne Bainbridge’s Ironies of Automation.
Mandi Walls: Yes. I excite that one a lot in my talks. Yeah.
Matt Davis: Yeah, that’s a good one. That’s another good, nice short, concise paper that gets a lot of discussion going because it’s so relevant to what we do.
Mandi Walls: Yeah, absolutely. And I also like Western’s paper on types of organizations. That’s another good short one. He derived it from medical organizations, but there’s still a lot of good stuff in there. And for folks that are looking for other resources for something like this, paperswelove.org is a repository there, and they have some local chapters. I don’t know how many of them are meeting now after the pandemic, but there used to be a pretty active one here in the New York area.
Matt Davis: Oh, cool, cool, cool. Yeah, that’s a great resource.
Mandi Walls: Yeah, absolutely. So they focus mostly on computer science papers, but this seems like something that would be helpful for, like you said, the folks across the organization, and especially [inaudible 00:14:25] product managers and those kinds of folks who might benefit a little bit from…
Matt Davis: Yeah, yeah, yeah. I have a great little story about this, about people outside of engineering joining into these things. A couple jobs ago, I was running a session every week, and we called it Practice of Practice [inaudible 00:14:45]. I won’t go into the details of why it’s called that right now. It was a weekly thing, we all got together, as I’ve been describing here. It was something that we just started in the SRE team, and at first it was only SREs that would go. Little by little, as the months go by, more developers would come in and more engineers, other software people, some QA people, some customer support people, and then the marketing folks, they heard about this and they wanted to join because they want to know how SREs work. They come to the session, and we did one of these games I call the Wheel of Expertise, and it’s very conversational, it’s interactive, and we use a big wheel and we spin the wheel. So everyone in the session gets a chance to go. So engineers go, SREs have a chance to go. And so this one woman in marketing had a chance to go, and she didn’t know anything about this technology. So we went through the game to learn about the technology, and then afterwards she came to me and she told me something surprising that I wasn’t expecting was that, “It was so awesome, Matt, to be in that session, and I learned so much empathy for those engineers.” The light bulb went on in my head when she said that. I was like, “Oh, wait. You’re right. This is a perfect place and opportunity to combine different parts of the business together.” And this session like other continuous learning sessions is not during an incident. We’re looking at things that we’d like to look at during incidents without the incident over our heads.
Mandi Walls: Yeah, yeah, you’re in calm waters.
Matt Davis: Yeah, yeah, yeah. Calm waters. Production pressure is non-existent basically, and we can take the time to dig into topics. So she found that doing that in the session was so incredibly important to her job as a marketer because she got to get some empathy for what the engineers… What their jobs were like. And I don’t think marketers get to see that a lot. They do their own thing, especially in this particular company was building tools for SRE, so they’re especially interested in understanding how SREs work. So it was incredibly valuable to them. And I think it may have been in that session or a different one with the same woman who we played a different game and she gave a marketing, a customer support themed answer, and the engineers mentioned the same thing. They’re like, “Wow, I didn’t even realize.” If I remember correctly, she described something that had to do with customer and the kinds of work that she has to do. It was difficult work to be able to handle the customer request and then do… Back and forth. She just went into all these details about it and the engineers had the same reaction. It was like, “I never realized that that’s all the stuff that marketing does. I just didn’t know.” When else would they have a chance to know that? Yeah. So the same applies for other divisions in the company, whether it’s… Customer service is a great one, they actually need to know how engineering works, so it’s especially good for them. I even had a technical writer come to one of these sessions once. We’ve had salespeople in these sessions and managers that will come in, which is awesome because that’s one of actually the places where I’m always afraid we’re going to have a conflict is with the software engineers. If I could pull out what’s one of the most difficult things about doing this? Whether or not you’re officially sanctioned by the company to do it, or if it’s a grassroots effort or if it’s already organized, it’s trying to convince people that it’s time worth spent, especially in the days of Agile where you’re planning your work down to the minute, down to the hour, and it’s like, “Well, if I go to this 90 minute continuous learning session, well that’s 90 minutes that I could be using towards my sprint.” And I think that a lot of feature work people struggle with that when it comes to things like this. The idea that they’re not allowed to do things like that, that has come up. That sounds weird and silly, but yes, people think that way. They’re not incentivized to do it maybe because they’re not on-call. “Why do I need to go to this thing to learn about the complexity of the system when I just need to worry about my feature and I’m not on-call?” Well, for all those other reasons we just talked about.
Mandi Walls: Yeah. You’re part of the whole, buddy.
Matt Davis: Yeah. Anything that you can do to try to tear down those silos is what I try to do. And sometimes you have to work with the managers. I remember a couple of times where we were getting pretty low attendance, but we were also not doing great on incidents, and so I was trying to talk to some managers going, “Hey, you know that weird mishap that happened during that incident? This is the session that we get together and talk about those things. It’s like an extended meta-learning review.” Because learning reviews normally… A postmortem is another name for them, retrospective, learning review is just the word that I try to steer people towards. Those are specific to an incident. I would look at a session like this, whether it’s something like what I did, whether it’s less formal, it builds reciprocity, and I think that’s extremely important between the people who are on-call and the people who are not. When those people understand each other, then the people on-call feel better about escalating at 3:00 AM. Think about this, I don’t even know this person. I’m on-call. Something happened. I need their help, but I have never met them. I’ve never met them. I don’t even know what their voice sounds like. I don’t know how they operate. I don’t know their personality. All those bits of anxiety at 3:00 AM when the system is down, that doesn’t have to happen. People need to realize that that’s not normal. If you have these sessions and you’re learning reciprocity and you’re understanding the ways that your fellow humans at the organization work, then you don’t get so much anxiety at 3:00 AM because you’re like, “Oh, Beth… Yeah, Beth and I know each other. She won’t mind. I know she said that she won’t mind if I call her in the middle of the night. She’s told me that.” Or, “I’ve seen her do that for other people, and she’s okay with it, so I’m going to go ahead and escalate.” It’s just learn what the rest of the team is doing and where the expertise lies, and it’ll make everyone more resilient.
Mandi Walls: Absolutely. It helps the whole organization, not necessarily coalesce, but know when you need to come together for larger incidents and things like that to have that practice, and it builds camaraderie when you’re in a really stressful place with a major incident in particular.
Matt Davis: There’s an interesting thing that you can see during incidents when this is working. I know there’s all kinds of different kinds of incidents, so I’m not trying to pigeonhole incidents, but basically I’m talking about two kinds. It’s the kind where I just described where nobody knows each other. They’re in an incident and they’re yelling demands or they’re barking instructions at each other and it feels really uncomfortable, but we get to the end and we do it and we mitigate it. And then there’s this other incident. Let’s just pretend these two incidents are the same severity, priority, what you want. In this other incident, everybody knows each other. You see people joking around. You see people a little bit more happy. You see people curious about things. When people feel the freedom within a good just culture, they have the freedom to feel curious and it opens things up. The incident becomes less stressful even because it’s the same priority, severity, whatever. We’re just working together like we always do. I’m a weirdo. I love incidents. I love being on-call.
Mandi Walls: You are a weirdo, dude.
Matt Davis: But one of the things that I actually love about incidents is getting to see people and work with people that I never get to work with.
Mandi Walls: Yeah. It’s one of those weird things where something’s happening. I’m an instant commander, but I usually end up being a scribe, which is fine. It’s been a long time since I’ve been hands-on anything production. So lots of folks come in and I’m learning all these teams that I haven’t heard of before and all these weird backend services that I’ve never heard of before, and everybody’s just chatting and figuring things out, and it’s a far less stressful environment here than it has been in other places I’ve worked where you have more of an antagonistic, “Well, that’s totally your fault,” kind of play when you’re on an incident with folks.
Matt Davis: Yeah. You are not as quick to point the blame finger when you have reciprocity.
Mandi Walls: Yeah.
Matt Davis: Think of those two different incidents I just described. What do you think is going to happen in the one where everyone’s mad about that there’s an incident and everyone’s basically angry at other?
Mandi Walls: Yeah, the post-incident review is going to be a horrendous mess, and you’d be lucky if anybody learns anything out of it.
Matt Davis: Right. Or they get to that thing and what have we learned? That’s my favorite unquestion in an incident is like, “Okay, what have we learned?” I don’t know. I’ve had a lot of people learned a lot of things we’re not going to talk about. And then the other one, it’s just like the learning is already happening and people in the incident are already learning and they’re already spreading the learning to other people and they’re like, “Hey, Matt, did you know that this config item actually changes this service so that it can’t talk? I never knew that. I don’t even know anything about that, the service config.” You start to learn more, and the only way that you get to do that is during incidents, so make a time where you can do it when it’s not in an incident.
Mandi Walls: Yeah, excellent. Yeah, I will definitely put a bunch of stuff in the show notes for folks, all the things that we mentioned there, so you can hopefully found your own, even if it’s just a Slack channel to start out with, places to share things and learn stuff.
Matt Davis: A Slack channel is a perfect good place to start talking about those communities of practice or guilds, I’ve heard them called guilds or whatever. It’s the same thing. It’s a bunch of people getting together because they have a common interest. Make it easy, make a Slack channel. I like to make a resilience channel when I’m in Slack so that we can have a channel like this, and I invite people to that channel, and I say, “This channel is for discussions about resilience, and memes are totally okay.”
Mandi Walls: Memes are required.
Matt Davis: They’re almost required.
Mandi Walls: So with this, I’m sure folks have a lot of preconceptions or maybe there’s some myths around the whole idea of continuous learning and as a practice, what kinds of things stand out for you or if there’s a myth you’d like to debug for folks that maybe they came into this episode thinking that continuous learning was one thing, and I’m going to point them in the right direction.
Matt Davis: So one of the myths that I think I dealt with directly, I’ve already alluded to is that it’s not just for people going on-call. Now, it’s going to greatly benefit the people on-call, obviously. They’re learning more about the system, they learn more about pieces of the system when they share knowledge together. But like I described about the marketers and the product people and the sales and support people, they will bring knowledge to this continuous learning that they would never have any other chance to bring. I mentioned this before. Another big myth is that incidents are extracurricular activities. They’re not. They’re part of the job and preparing for incidents, doing continuous learning, which prepares you for much more than just incidents. That is time well spent. That is proactive. Sometimes I have trouble explaining to executives what proactive in SRE really means, and guess what? It’s not monitoring.
Mandi Walls: All right, cool. Is there anything else you’d like to leave folks with today? Any small nugget if they’re still on the fence?
Matt Davis: I think I would just underline the fact that anyone at the org can start this. It doesn’t have to be company sponsored. Don’t wait for it, do it, organize, look at making sure that, “Hey, we’re building in these hours to our Agile, to our Scrum.” In Agile corporate overhead kind of thing. This counts as corporate overhead and you really need to have it there. Do it, even if it’s just one or two of you, Practice Practice [inaudible 00:28:20], the first few sessions that we did were just two or three of us together, and it’ll look that way and it may look that way for a long time, but that’s okay. At least you’ve started.
Mandi Walls: Make that investment for yourself.
Matt Davis: Yeah.
Mandi Walls: Awesome. Well, this has been great. That’s super good. The one off topic thing that I was thinking about at the very beginning when you said about everybody has a relationship with music or music in them or one of those things, one of the sidetracks that brought to mind, especially when we’re talking about the health of the system and things like that, it was like with alerting… Years ago I saw a short piece, or maybe it was an article or something about changing medical equipment to have different pitches of sounds, different kinds of noises to represent different states, and I feel like we’re still missing that in our part of the world. It feels like the steady state should have a nice tone to it, and when something goes wrong, it should be slightly discordant.
Matt Davis: Before we worked together, I was at a data center company called Digex.
Mandi Walls: Oh, yeah, Digex.
Matt Davis: Yeah, yeah. In Maryland. So this was the early ’90s, so computers didn’t come with sound cards back then, and these were all Sun hardware, and sometimes a machine that didn’t get labeled or something like that in the data center would have a problem and we would actually pipe an audio file to dev audio and it would play out the teeny little speaker on the Spark two, and that would help us locate the machine. And even though it was kind of a joke, because I think we played the Imperial March from Star Wars or something stupid like that, it’s a signal, it’s a legitimate signal. I’ve heard people joke about, “Well, if we just have all the sound going on, then people will notice when the sound changes and things like that,” and it’s like, I think being creative with these signals is definitely a good thing to do. I don’t know how else to do it. It’s really easy sometimes to hear patterns in music than it is to see patterns, so it’s a really interesting thing to think about. It’s almost a brain hack. And I kind of do this sometimes myself, what kind of little mnemonic devices can I use to help me triangulate that signal? Can music be that? That would be really cool. Not as a joke, but as something legitimate that we can use to figure out the system.
Mandi Walls: But if anybody out there is already doing this and we haven’t heard about it, let us know. You can always drop us a line. We’re a community-team@pagerduty.com. We’re always looking for interesting things to showcase and today has been no exceptions. It has been super interesting, Matt. Thank you so much.
Matt Davis: Yeah, absolutely. I love talking about this stuff. I love getting into conversations about how humans are the resilient part of our complex systems.
Mandi Walls: Absolutely. Where can folks find you if they want to follow you on social media and know more?
Matt Davis: I’m on LinkedIn, Matt Davis, look for the LFI specialist tag on my name. I am on Mastodon, but really the easiest way to get to all these different pieces of me, go to sounding.com. That’s my private blog and I have a row of links to all my social media up at the top.
Mandi Walls: Awesome. We’ll put that in your bio as well so folks can find you. All right, Matt, thank you so much. Everybody else out there, we’ll wish you an uneventful day and we’ll talk to you again in a couple of weeks. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two. Thank you so much for joining us and remember, uneventful days are beautiful days.
Matt Davis is a musician that finds himself in the cyclone of managing socio-technical systems in a fast-paced digital world. He has recently led the transformation of Incident Management and Resilience Engineering practices at multiple organizations. His practitioner research on improvisation and team collaboration has led to the development of a Practice of Practice technique that opens opportunities for group learning. Connect with him on the Fediverse at c.im/@dtauvdiodr, keep up with his progress on teamwork at practiceofpractice.com, or subscribe at youtube.com/dtauvdiodr.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.