George Miranda: Hello, and welcome to a brand new podcast, Page It to the Limit. This podcast focuses on what it means to operate software in production. We do that by asking how you respond when unplanned disruptions impact your users. What can we do to make that experience better? Our goal is to cover the leading practices that we use in the software industry to improve both system resiliency and the lives of the people responsible for running that software. Some of our upcoming episodes focus on topics like chaos engineering practices with StichFix’s Bruce Wong, observability with Honeycomb’s Christine Yen, and smart alerting with Google’s Yuri Grinshteyn. But today, we’re going to get started by introducing this show, getting to know all of our hosts, and giving you a more in-depth look at some of the things that we’ll be discussing in future episodes. I’m one of your co-hosts, George Miranda, @gmiranda23 on Twitter.
Matt Stratton: I’m Matt Stratton, @mattstratton on Twitter.
Julie Gunderson: I’m Julie Gunderson, @julie_gund on Twitter.
Scott McAllister: And I’m Scott McAllister, @stmcallister on Twitter.
George Miranda: This podcast is about running software in production and what happens when disruptions occur, right? And that can take a lot of different forms. So let’s spend a few minutes talking about what this podcast is and why we’ve decided to do it. So as a lead up to that, we’re going to start by picking on a little bit of a pet peeve. So for today’s topic, which is running software in production, we’re going to start by talking about one particular myth that we would each like to debunk. So Matt, why don’t we start with you?
Matt Stratton: We’re about running software in production, and my myth has to do with what’s going to sound like maybe something that happens after, but it’s really because it’s a continuous process. And my friends know that I’m really passionate about the term root cause, and it might give me shivers up my spine when I hear it. And I much prefer to think about contributing factors. And the reason is there’s no one single root cause for things. We have complex systems. And I know talking about the words we use might sound pedantic and I’m getting picky, but I think words influence how we think about things, and the phrases and terms we use could be really influential. So the myth I want to debunk is that there is a root cause. There is one root cause. You know what it is? It’s the big bang. Julie?
Julie Gunderson: I would have to say that one of my pet peeves is talking about just tools. I think there’s a lot more to running software in production than just a tool. I think a lot of it is about the culture and the processes on the team. And I would say that psychological safety may be one of the most important things here.
Scott McAllister: For me, as a developer, focused more on the API and APIs in general. So the myth that I think about is how that software engineers will just “get” your API with very little documentation or samples. Well, some will. To really fully support your whole community, you need to be able to have good documentation, good sample code, and good tools to help people get started so you can support your entire community.
George Miranda: I think for me, the myth that I want to take on is that you don’t need to run software in production to be a good software engineer. And I think that’s a little bit like saying that you don’t actually need to see a bridge that you built withstand a real traffic load to know that you’re a really good civil engineer, right? I mean maybe in some edge case scenarios, that might be true and you can sort of infer. But generally speaking for the most part, you need to understand how the things that you’ve built get used in real life by real people in real situations. If you don’t have that experience and feedback from how it’s actually used, how do you know that the things you’re doing really work well, right? I think without that direct experience of dealing with it in how it’s used, you can’t know for sure how your software operates. So with that, which discussion topics or practices do you think are relevant to cover if our goal is to help folks understand how to build both more resilient systems and better quality of life for engineers?
Matt Stratton: I think a key practice is the idea of learning from incidents. And what I mean by that is how can we get better at actually learning and taking incidents as gifts and not just something we react to and create action items and tickets about? But how do we bubble this up into our larger organization and share the information, but then with that information, how do we learn?Identifying information is just the beginning. We have to then learn from it.
Julie Gunderson: To follow up on that, Matt, especially when you talk about collaboration and bubbling up, I would say that that’s one of the things that I find to be really exciting to me, is how can we work together more to learn from each other and to own our code and services throughout the entire life cycle and throughout the team and the organization, and break down those silos and function more as a whole unit?
Scott McAllister: The things I want to focus on are more like tools and for automation and for development and things like Terraform, and also ideas on how we can write clean code so that as engineers who work in teams together. We spend more time reading our code than writing it, so should write clean code, so different opportunities there, and principles there of just being a little more organized with our code.
George Miranda: I got to say, I kind of love that. I love the variety of viewpoints that I think we’re covering given some of our different backgrounds. I think that’s a really good list to start with. And the only thing that I would add is that there’s such a wealth of topics that fall under this umbrella. The only thing that I would add that we haven’t really covered yet is I think it’s important to hear from other practitioners, from people that are doing this day in day out and get an even wider sample of varying viewpoints, right? So I think one of the topics … it’s not quite a topic, but I think one of the essential things to include is different viewpoints. And so I’m excited that we have a number of different guests lined up to help us do that.
Matt Stratton: I want to loop back on Scott. You brought up tools and I think it’s important that we understand the tools aren’t what gets you there by themselves. And sometimes I think we over-correct, and we say it’s not about tools, it’s just about people, and then we sacrifice actually talking about tools when tools are important. So this could be a whole episode by itself. But Scott, since you brought up tools, I’m interested to know what you think about how tools influence culture.
Scott McAllister: I think there is a lot of importance put on tools. In a lot of software shops, people pick a framework. They pick a stack and they stick on that tool and they shape their decisions on those tools that they’ve picked up. So there is a focus there on that. I think it actually does kind of shape your culture there. Tools are an aspect of it, but your people are also another aspect. And I think the more important aspect is the people that you have that can solve the problem. So tools are just a piece of it, but I think they’re an important piece.
George Miranda: I think Matty was absolutely trolling me when he said that because I immediately wanted to spew off a number of things that you’re right, could be an entire other episode. But the only thing that I’ll add to what Scott said is that having grappled with this quite a bit, I think tool choices matter, but it’s not so much about the tool itself, it’s the default ways in which they make you work. So for a long time, I’ve said that it’s really hard to change culture, right? So how do you even do that? The tools that you use enforce certain default behaviors. Those daily behaviors become your habits, and those habits are what become your culture. So it’s not so much about the tool, but it is about how it affects your workflow and the things that you do.
Matt Stratton: So along those lines, we’re thinking about, this is a new show. We’ve got a bunch of hosts, and we have the ability right now, we have a captive audience I hope. So let’s let our audience know what topics in general you as an individual host of Page It to the Limit, what are you particularly interested in? Julie?
Julie Gunderson: Well, I think I mentioned it before, psychological safety. That is something that is really exciting to me. And we’re going to put a link to the Google study on psychological safety in the notes of this podcast, and then full service ownership. How do we work together to own our code in production? And one of the things that actually fascinates me is when there’s a resistance from software engineers to owning that code, I would like to know why. What is it about the responsibility of what we’ve created? And I do think that there’s an ethical stance, and that’s something that’s really exciting to me.
Scott McAllister: I’m really excited to talk about developer education and experience, just kind of the experience of using software products as a developer, but also working with APIs. Again, that’s kind of my horse that I keep going back to, is talking about APIs. And so I think that’s important and that’s something that as engineers, we face every day. So that’s an important topic. The other thing that has been interesting to me lately is open source software communities and dealing with different projects and seeing how different projects handle contributions from inside a community, but also inside of a company and how that relationship works together. That’s just been fascinating me lately and just seeing how we can collaborate together and come together from lots of different backgrounds to create tools that other people use and that are good quality. And it’s really fascinating to me.
George Miranda: I love open source communities and I’m so glad that you brought that up. I think that many of us here have spent quite a bit of time in open source communities. And so I think that’s going to continue to be a focus that you’ll hear from us quite a bit. I would skew in a slightly different direction, and I would say that for me, I love emergency response systems. I love understanding what the plan is when things go wrong, and I love making those approaches better. So some of that is incident response patterns like Matt mentioned. Some of that is ways that we do continuous learning in an organization. And then some of that is how we actually test in production … to use a popular phrase that’s been thrown around lately … which is how do we do these things for real, right? It’s sort of, like I mentioned in the myth-busting, how do you know that your software is resilient unless you actually break it, right? So looking at practices like chaos engineering and controlled game day scenarios, right? How do we introduce failures live in a controlled way that can help us run better systems?
Matt Stratton: So topics I’m really interested just to kind of bring this around, again, it’s really the human factors because our systems are made up of bits and ones and zeros and wires somewhere in cloudy things, but they’re all built by humans, they’re operated by humans, they’re consumed by humans. And thinking about the human factors all the way through, from the people who designed them, the people who ask for them, the people who run them and the people who use them. And then again, I talked about it before. I’m very passionate about learning from incidents. And so I think we’ve got a really nice gamut of the whole life cycle and our interest level. So I think there’ll be a little something for everyone as you subscribe to this podcast, and we hope that you will.
Scott McAllister: But first we forgot to tell you who we are. Let’s do some intros and get to know the co-hosts. Matt, let’s start with you.
Matt Stratton: Yeah, great. I am a DevOps advocate at PagerDuty. What does that mean? Who knows? Doesn’t matter. But I’ve worked in technology ops since the 1990s, and I’ve always said that I’ve been an ops practitioner for decades and now they pay me to talk about it. So it’s not a bad gig if you can get it. I’ve also been really involved in th DevOps community and the DevOps movement for many years. And I have a podcast called Arrested DevOps. So when we started talking about doing this show, I was like, “Hey, another podcast. Why not? What could go wrong?”
Julie Gunderson: Hi, I’m Julie Gunderson. I’m a DevOps advocate with PagerDuty. I actually had the opportunity to work in many different roles in my life in technology. I started as a recruiter before I even knew that recruiters were not loved and got to meet some amazing people and then convinced my company that we should be part of the community. From there, I worked with cloud partnerships, and knowing PagerDuty’s reputation, I was really excited to have the opportunity to work over here. On top of that, I am a co-founder and organizer of devopsdays Boise, a co-organizer of DeliveryConf, and a community moderator for opensource.com.
George Miranda: Matt, I want to loop back to something that you said. I love your take, and I’m often struck by how similar our backgrounds are, but one of my go-to lines is that I used to be an engineer, now I just talk about being an engineer, which is hilarious, right? That’s basically your take. It’s not quite that bad. I still get some command line time, and there are projects. And we’ll talk about some of those in future episodes. But again, I’m George Miranda, and my background, I was a career long infrastructure engineer dealing with large scale … or at the time, what seemed like large scale distributed systems in finance and the video game industry mostly. I hated being on call, I mean, loathed. So many on-call disaster stories and so many that just took a toll on my personal life. And really what happened is I ran away to become a vendor so that I didn’t have to be on call anymore. After moving into vendor life … I did some consulting at Chef, which is actually where I first met Matt, mostly focused on building CI/CD pipelines and figuring out how to do the DevOps. Then I did some work with containers and the Kubernetes ecosystem by dealing with things in the service mesh realm. And then ironically, eventually I came back around and I ended up working at PagerDuty. And I say ironically because I hated being on call for much, but it turns out there’s a lot that we can do to make it better, and there’s this good Venn diagram of interests that all come together when we talk about emergency response.
Scott McAllister: And I’m Scott McAlister, I’m a developer advocate at PagerDuty. I focus on the experience that developers have working with our APIs. My background is … has been a software engineer for several years, started out in ColdFusion, moved over to Java, then did some Python and then some Node, and then a lot of front end stuff like Angular, test around with React a bit. Working on that actually lately in my spare time. And even a little bit of Golang in there. And along the way, I found that I love building communities. I built a community of engineers back around ColdFusion, back when I was in college, and loved the fact of bringing people together and learning from one another and also bringing in speakers from outside who knew about things that we didn’t know. So that’s what makes me excited about being in developer relations and being a developer advocate here at PagerDuty I’m based outside of Seattle. And right now I help manage the Bellevue JS meetup group.
Julie Gunderson: Well, as you can see, we all have pretty diverse backgrounds. So let’s talk a little bit about what brought us all here today and towards focusing on incidents and incident response practices. George, do you want to talk to us about what brought you to PagerDuty?
George Miranda: Sure. And I’m going to do a little aside first. So as our listeners may have figured out, we all work for PagerDuty. And Page It to the Limit is not a PagerDuty-specific podcast. It is not really about the PagerDuty product, but we’ve been talking about this for a while, right? We’ve been talking about doing a podcast around specific incident response practices, because I think we all have things that really led us here or make that really relevant. So for me in particular, I don’t know how to say this, but I basically have some questionable hobbies, whether it’s motorcycle racing or doing back country snowboarding. I work with fire art. I do dangerous things. And so I became an EMT and wilderness first responder out of sheer need. Long story short, I was in a situation where somebody got hurt and I felt very helpless because I didn’t know what to do. And so ever since, I’ve pretty much geeked out on emergency response systems and ways to deal with emergency situations that catch you off guard, right? I am not a super planner in some ways, but in some ways I am. Like you need a plan. You need to know what you’re going to do when things go wrong. And what happens when an emergency unfolds is it never goes according to plan, right? It never goes exactly the way you think it’s going to go. But studies have shown repeatedly that you are in order of magnitude more likely to navigate disasters successfully if you’ve defined what that response is going to look like. And a lot of that is practicing it and committing it so that it’s muscle memory, and you kind of know what the right thing to do is when it happens. And so when I said there was a nice little Venn diagram here, it’s a Venn diagram between my professional career and some of my personal hobbies and I think the things that I’m super interested in. So for me, it’s that I love emergency response systems. I don’t really love emergencies themselves. Those kind of suck, but they’re inevitable, right? What I do love is being prepared for them.
Scott McAllister: With all software, we have incidences with different systems, different … everything has something we’re watching in real time. And to do that, to watch that, to watch our different systems, we need to be able to connect them to the alerting tool. And so for me, that’s what attracted me to this space, was helping people build software that can more easily alert you of incidences. Short answer, but that’s where I’m at.
George Miranda: Scott coming in with a focus on tools. Love it.
Matt Stratton: So I spent a lot of time in the DevOps world before thinking about how people ran their systems and how they released software. George talked about that before. Coming from shop, I thought a lot about configuration management of your infrastructure, thinking about continuous deployment, continuous delivery, how do you ship software. And as I moved through helping organizations transform how they did that stuff, it all came back to the people, right? Tools are easy, people are hard. And like I said, I’ve been in ops for a long time. And I carried a pager, sometimes a literal pager, for most of my career. And it was really kind of nice to get away from that for a while, like George talked about. It was nice to go work for a vendor and say, “I don’t have to carry a pager anymore. I don’t have to be on call. There’s no sales engineering emergency.” My sales reps might’ve disagreed, but that’s a different story. So when the opportunity came to work with PagerDuty, this was a way to get back to that part of the work, to that part of the ops side. But then it was more than that, is what I started to learn as I spent more time in the space. And the other thing is PagerDuty’s approach to being about people and the care of those or on-call and making all of that better is what has resonated to me and what keeps me here. It’s always important for me to feel like where I work is making someone’s life better, right? It doesn’t necessarily have to be making all of humanity’s civilization move forward to the singularity of Nirvana, but someone’s life is improving because of the work that I do. And I really do think that PagerDuty and the solutions we provide, they make people’s work life better. Someone’s having a better day than they wouldn’t have otherwise because of PagerDuty. And that’s why I’m here, and that’s what brought me here, and it’s what keeps me here.
George Miranda: Matt, I know that’s not where you were going with it, but I got to say the first time that I could ditch my physical pager and I was not on call, that did feel like a step towards the singularity Nirvana.
Matt Stratton: The other amazing thing that happens when you ditch your physical pager … and this is dating me … but when I was able to turn in my pager and Nextel, the cuffs of my khakis were evenly worn because for years of on-call, the right cuff of my pants was always frayed because I carried my pager and my Nextel on my right hip. And so your pants kind of do this little drop on that side, and so the cuff gets dirty and frayed. It’s super duper true. So the beautiful thing about us moving to phones instead of pagers is our clothing is more durable now, or it doesn’t have to be as durable.
George Miranda: Nothing says tethered to being on call for the rest of your days like having a communications device tethered to your belt.
Scott McAllister: Going off of something else Matty said, one of the things for me is once I met the people at PagerDuty … you mentioned people … that is when I realized this is what a place where I want it to be. PagerDuty focuses on making people’s lives on call better, but they also focus on their people inside. So that’s something that attracted me to the company.
George Miranda: One thing that I’ll add to that is I, after many years of not being on call, and back willingly into on call rotations. So I think, Scott, to your point, that actually says something about how seriously we take it.
Julie Gunderson: And I would say for me, as I mentioned, I started as a recruiter, which means that I got to hear why people were leaving their jobs. Why didn’t they like their companies? What was it that made them miserable? And to have the opportunity to go out and talk about the best ways to do things, the ways to make people’s lives easier, it was just one of those that I couldn’t pass up. Beyond that, the reputation of PagerDuty, I was really proud to be able to join the team and to work with such brilliant people. So for me, I guess being able to spread sparkly DevOps cheer is a part of my motivation.
Matt Stratton: All right. So the show is called Page It to the Limit. What the heck does that even mean? And I will tell you something. We know naming things is the hardest thing in the world, so we spent a lot of time trying to think of a name for this show. We had some interesting suggestions from our fellow employees and community folks. So a couple of … so I’m just going to share some of them. Maybe you’ll think we should have picked one of them. Too late because we already picked it, and this is immutable. That was maybe a different kind of DevOps joke there. Some of the suggestions we had, Escalate After Listening, Paigey Talks … and if you’re listening and you don’t know, Paigey is the mascot of PagerDuty … Press One To Act, Press Three To Listen, The On-Call, and then also What’s Up Time. So I think those are all great suggestions, but I’m pretty happy with Page It to the Limit. I think it kind of drives … we’ve got some connection back to the organization, to thinking about on-call, but we’re talking about pushing the limits, man. We’re going for it and being extreme, and Mountain Dew, if you’re listening, we want a sponsorship.
George Miranda: Matt, I groaned as you were reading off some of those suggested names. Thank you for digging that up and calling us all out. Page it to the Limit had some, let’s say, interesting inspiration. It was a name that was thrown out there. And you know what, I think aside from some of the other names that you mentioned, which were funny, or a little bit of an inside joke, the thing that I really liked about Page it to the Limit is that, like you said, Matt, Mountain Dew sponsorship aside, and I’m not saying don’t sponsor us … although I personally am a Monster energy drink kind of person, so either one will be good … but aside from that, you’re right. I mean, we are talking about pushing the leading edge of response practices, right? So what happens when you get paged, right? How do we make that experience better? And it’s not just about that, but about all the preventative things that we can do to keep those situations from happening, or to mitigate that, or to hopefully get to a place where the things that are eventful that causes us a bunch of stress are actually a little more uneventful and a little more routine, and everything that goes into that, right? So it’s not just pushing the edge on what happens when we get paged, but pushing the edge on what we can do to learn from incidents afterwards. And for me, the reason I said yes when I heard it is because it’s about pushing the limit in multiple directions.
Scott McAllister: So what’s on the horizon for content?
George Miranda: Well, we recorded a number of episodes at PagerDuty Summit. And I think the one that we started with, I am super jazzed about. So I sat down with Bruce Wong from Stitch Fix to talk about chaos engineering practices. So at PagerDuty Summit, we ran a workshop where we approached chaos engineering in I think what’s a pretty realistic and tangible way that anybody can use. We did a number of exercises to get you to sort of think through where you start. And the thing that I loved about this workshop is that it really proves that … look, man, I mean, I know that I use the phrase testing in prod, which can be a little bit of a loaded term, earlier. You know what? It doesn’t even have to be testing in prod to make this approachable. There are a number of considerations you can think out by just sitting down, talking to your teams, and flushing out considerations. And so Bruce goes through I think what the iteration is of how you get started with simple thought exercises all the way up to what you can do to run tests in production, simulated failures, chaos engineering, and how to get people to buy in along the way. I think it’s a really good episode.
Julie Gunderson: Additionally, at summit, we recorded an episode with Yuri Grinshteyn from Google. And Yuri has a really big passion around smart alerting and making sure that people are not woken up in the middle of the night for non-actionable alerts. So we talked a little bit about the history of alerting and monitoring and how you can tune some of those systems. Really excited to hear more about what Yuri has to say.
Matt Stratton: I got to sit down with Christine Yen from Honeycomb to talk about observability, which is a really hot topic in the space right now. You’ve probably heard of it. If you are curious about what observability is, what it really is and how it can help you, I really encourage you to listen to this episode. We had a great chat. I learned a lot. I thought I knew stuff about observability. Christine taught me a lot, which was awesome. So if you’re curious about this, check that episode out. It’s a great conversation that talks about how observability can really bring you to that next level of being able to understand your systems and how your systems are trying to talk to you and getting information to people who can do things when they have questions.
George Miranda: I love that roundup. I think that’s a really great place to start. So we’re going to cut it off there, even though we have some other content on the horizon, because all of these episodes are available for you to listen to today. So if you’ve started with this episode, check out your feed. You can find those episodes available for you to hear now. So that’s all we have for you today, but if you like what you’ve heard, go ahead and remember to subscribe to this podcast because you’ll be seeing more content soon. And with that, I’m George Miranda, @gmiranda23 on Twitter.
Matt Stratton: I’m Matt Stratton, @mattstratton on Twitter,
Julie Gunderson: I’m Julie Gunderson, @julie_gund on Twitter.
Scott McAllister: And I’m Scott McAllister, @stmcallister on and we’re all wishing you an uneventful day.
The hosts start out with their pet peeves: which myths about running software in production would we like to debunk?
Scott: “As a developer, I focus on APIs… So the myth I think about is that software engineers will just ‘get’ your API with very little documentation or [code] samples. While some will, to fully support your community you need to have good documentation, good sample code, and good tools to help people get started.”
We also cover “root cause” vs “contributing factors”, the role of psychological safety in comparison to picking the right tools, and the obligation software engineers have to manage their software in production if they want to write “good” code.
The goal of this podcast is to discuss practices that build both more resilient systems and better quality of life for the people supporting them. In order to do that, which topics would we like to cover?
Matt: “I think a key practice is the idea of learning from incidents. And what I mean by that is how can we get better at actually learning and taking incidents as gifts and not just something we react to and create action items and tickets about.”
We also cover collaboration, development tools, and the importance of diverse viewpoints.
What are each of the hosts focused on?
Julie: “How do we work together to own our code in Production? When there’s a resistance from software engineers to owning that code in production, I’d like to know why.”
The hosts each discuss their own interests which span across psychological safety, developer education & experience, open source software communities, emergency response systems, chaos engineering & game days, human factors, and learning from incidents.
But wait, WHO are these hosts?
George: “I hated being on-call. I mean loathed being on-call. So many on-call disaster stories that just took a toll on my personal life… Ironically, I ended up working at PagerDuty… It turns out there’s a lot we can do to make [on-call] better and there’s this good Venn diagram of interests that come together when we talk about emergency response.”
Everyone does a quick roundtable introduction.
The hosts find common ground from their diverse backgrounds.
Julie: “I started [in IT] as a recruiter which means I got to hear why people were leaving their jobs… What made them miserable?… To have the opportunity to help make people’s lives easier was one I couldn’t pass up.”
Everyone shares a brief version of their origin stories. Scott continues to focus on tools. LOL
Mysteries are revealed.
The hosts talk about the next few episodes and what to expect on the show.