Julie Gunderson: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson, @Julie_Gund on Twitter.
Today, we’re going to talk about the important topic of change management and why it matters. We’re joined by our VP of engineering here at PagerDuty, Gautam Prabhu, Who’s going to talk to us about change management, and how we think through it here at PagerDuty. Thank you for being with us.
To get us started, how would you describe change management for anyone that’s new to a more formalized practice?
Gautam Prabhu: The framing that I always like to use for change management is what is the best way to get to your goals? When we put that in the context of a software organization or a software team within a software company, what are our goals? Our goals are first of all, to run safe, stable infrastructure, and product for our customers, but also to improve that over time. Those are sometimes conflicting goals where you’re, I want to run something that’s just running and you can rely on no matter what, but I also need to change it a lot. There is a well-trod path of a process that you use to balance those goals against each other, in a way that gets you the best possible results keeping the customer in mind.
Julie Gunderson: Thank you. One of the things that we always ask is, what is a myth or a common misconception that you want to debunk when it comes to change management?
Gautam Prabhu: I have many. The first myth that I want to debunk is that change management is something that is unique to software. I mentioned we’re a software company and I’ll be talking about change management within the context of a software organization, but change management is something that’s useful everywhere. If you have a sales process, a process for billing, for invoicing, for marketing, all of these things have processes that benefit from someone looking at them and saying, how do we make sure that our processes are best achieving our goals? The first myth is that change management is something that sits inside purely a software work. It doesn’t, it’s super necessary for software teams, but there is an art to it that’s even beyond just the software part of the equation. I think that’s one myth.
Gautam Prabhu: A second myth is that there’s one way to do it. There is no one perfect way of doing this. To dive a little bit deeper into that. You have to take the circumstances that are actually facing you to make the perfect change process, and if I were to give some extremes, you’ve got a two person startup that has no product, no customers, and just needs to get stuff done as fast as they possibly can. They will have a very different change management process than, I don’t know, someone running Gmail who basically has hundreds of millions of people who rely on them every single day. There’s not one way to do it. It’s an art more than a science. There are some things you can study to say like, “Well, how do I approach this problem?” But each company will have its own kind of unique way and unique flavor of change management, because each company is slightly different in terms of the thing that it’s being asked to do.
Really, what I find is you need to have some opinions on, how have I seen this done well, what are some general principles that I want to be true? But you cannot just say, give someone else, “Well, here’s the flow chart that you would follow to get to a perfect change management process,” because you don’t know all the variables. It’s going to be different from place to place from company to company, from team to team, and from phase to phase. The change management process that was present at PagerDuty in 2009, when the company started and there were three people sitting in a room together, it will be very different from where we are today, and where we are today will be very different from where we are two years in the future.
Julie Gunderson: Thank you for that. I know we talked to a lot of organizations, as you mentioned, different sizes, different verticals, right? I know that for example, financial services organizations may have very, very strict change management, versus, as you mentioned, that startup with two or three people. Let’s talk a little bit about how we do it at PagerDuty. How do we approach change management? If you want to talk about how we kind of did it when it was just Alex and the other two in a room on folding chairs versus how we do it today. That’d be great. How does PagerDuty handle this?
Gautam Prabhu: Having been in that same situation that Alex was in with the folding chairs, with my own startup, I can pretty much promise you there was no change management. The change management was just get it out, however you want to get it as fast as you can, in the earliest days. Fortunately for all of our customers, that was the right decision then, and we have matured way past that now. When we think about how we approach change management, like I said, the first thing we have to ask ourselves is what are our goals? Let’s put some numbers and some actual like meat to those goals.
The first thing we have to think about is we need to run something stable. We are an infrastructure company, and some of the biggest names in software, that everyone knows, rely on us to be part of their infrastructure and the thing that as Tim Armandpour of SVP Engineering says, “We have to be at our best on the days that our customers are at their worst.” The first thing we have to say is, when we say we need to run something stable, “What does that mean?” We have numbers. We have SLAs, service-level agreements that we have with our customers, and internally we have what are called SLOs, which are service-level objectives, which are even more rigorous. If I promise you 90.9% up time, I’d better be shooting for much higher than that inside.
That just helps us put a framing on when we craft a process, what are the numbers that need to stay solid, and how do we know if our change management process doesn’t let us get to those numbers, it’s the wrong process by definition? Now at the same time, there’s another thing you have to look at, which is we do need to change the software, no one buys PagerDuty, or really any SAS software saying, “I expect it to stay the same forever.” You’re buying this offer because you expect it to improve over time, and you put out roadmaps for customers of here’s where we think this is going.
There’s this balancing act we have to think about where, how do we achieve our roadmap in a predictable way while maintaining this high SLA that we offer our customers. Now we have to back fit a process to that, that fits those requirements. That’s kind of like the highest level framing. If you go down now from the bottom-up, PagerDuty has a philosophy of what we call continuous delivery. If you look at maybe a couple of opposite ends of the spectrum, you have on one end a scheduled deployment, meaning here’s the day on which we’re sending out as many changes as we can at the same time. A scheduled deployment also generally implies a much longer timeframe of planning, meaning we need to plan out what’s going to happen in the next three weeks, or in some cases three months, and it’s all going to hit on this day, and there’s benefits and drawbacks to that.
On the other end of the spectrum, you have total chaos. In the middle you have something that has come to be called continuous integration and continuous delivery. The kind of mentality you have to have is, when someone is writing code, their goal is to get that code out to production in a safe way, as fast as possible. Don’t wait on other things, get it out to production, make sure it’s working well in production, and then go back to your next task. Rather than work on 10 things, batch them up, have hundreds of other people batch up their 10 things and send out all 2000 at the same time.
We have chosen this philosophy of continuous delivery, and that’s for a few reasons. The first thing is in terms of our ability, when we talk about roadmaps, we can release things on a pretty tight time frame, because we do have this idea of, we can send things to production at any time. Sometimes that will be a feature that a customer really needs. It could be a bug fix that a customer really needs, or it could be part of a feature that’s building up to where it’s something bigger than just sitting there kind of hidden, but we can get it out, sort of like in little, little chunks and get it out fast. That’s one reason we do it, is we can get things out to production faster and that makes our customers happier.
The second thing is, it’s kind of counterintuitive. There’s some improved reliability you can get by doing what’s called a continuous delivery. The best way to think about it is that you and 500 people have to get together, and launch 500 changes at once and something goes wrong. First of all, many things could go wrong, because 500 things have changed. Second of all, when you’re looking at it, it’s hard to say what change caused this problem, because there’s 500 of them. Sometimes it will be very obvious. Sometimes it won’t. Sometimes it might be two things that if they’d gone out separately, wouldn’t have caused a problem, but now they went out together and it’s very hard to say like, “Oh, it’s because this changed and this other team made this change, it caused this bug.”
When you have a problem like that, what you usually fall back to is a very lengthy deploy process with a very lengthy QA cycle, and a lot of Gates and checks to basically say like, “Are we good? Are we good? Are we good?” Then if anything goes wrong, what you’ll often do is you’ll roll the whole thing back. One thing went wrong, out of these 500 changes, but because it’s so difficult to figure out what went wrong quickly, and you don’t want your customers to be experiencing that pain for a long period of time. You’re like, “Okay, everything gets rolled back, all 500 changes.” With continuous delivery, you can release something very small. If it’s gone wrong, you can roll it back very quickly.
Julie Gunderson: Now that’s amazing, and that’s one of the things we hear too, is about being able to roll back quickly. But let me ask you, with all these changes happening at the same time, how do we know that we’re in the right environment, the most up-to-date environment?
Gautam Prabhu: That’s a really interesting question, and it kind of gets to the phases of how to do continuous delivery as safely as possible. The first environment that any change should always be made on is a developer’s local environment, like their laptop. There’s an expectation that that’s the first place that you validate that whatever you’ve done is running well. There’s many ways to do that. You can test things manually. You can run a suite of automated tests, or a subset of a suite of automated tests, but that is the first place you make sure things are going well. Then traditionally, what happens is there’s another environment. Some people will call it the QA environment, or the staging environment, or the test environment, but that’s now, I’ve taken my change that I’ve made off of my laptop, and I put it somewhere else where I can test it with all of our other systems kind of as close to what production looks like as possible.
That’s also a super important experiment to do, because you know what? Your local environment might not have all the most recent changes, and so what you’re trying to do is get something that’s much closer to what production looks like. Now, you have a local testing environment and then you have a staging or QA testing environment. Then the next step is to go to essentially a pre-production environment. Your goal there is at this point, you’ve done a bunch of tests. You’ve tested it locally. You test it in staging, you’ve run automated tests, you’ve run kind of like tests that run through maybe a user journey, like in an automated way. You’ve had someone review your code. All of these tests are passing, but there’s still risks that something could have gone wrong.
Now what you do is, you go to a stage where some percentage of your customers will experience this change, but it’s a small percentage, and you’re looking at what’s happening. You’re looking at error rates, you’re looking at response times, and if you see something you don’t like, you roll it back, and you have affected some customers, but it’s a small percentage of them, and it’s for a very short period of time. One common term for that, which we use at PagerDuty is canarying. The change in production, but for a very small subset. The last phase, is once you feel comfortable with what you see in this canary environment, because you are running on your production infrastructure at this point, then you release it out to production and everyone’s got it.
Julie Gunderson: What I’m hearing is we have the notion of a blast radius.
Gautam Prabhu: Exactly. The blast radius will get bigger and bigger, and then ultimately at the end, all of your customers will have it. We all know it’s impossible to say that bugs never make it out all the way to the wild, but our job is to filter out as many of them as we can at these stages that we put in the way beforehand.
Julie Gunderson: Then what I also heard you say was we have peer review here. We don’t have major cabs that are looking at the 50,000 lines of code. Right? Can you talk a little bit more about how that’s worked for us?
Gautam Prabhu: Yeah. I think peer review is one of those things, which is just general good hygiene for any software company. It’s actually something that we take really seriously here, so any change that anyone is making to a code base must be reviewed by someone else. We don’t have super strict requirements on, it must be reviewed by a manager, or it must be reviewed by a senior engineer. What we ask is put a second set of eyes on it, because there’s a lot of utility that you get out of that. The first is, sometimes you catch just obvious bugs or errors. The second thing is, if there’s questions that you have, or sometimes design questions, that’s a good place to go back and forth with someone, where it’s almost like, “Here’s what I’m thinking of doing. Let’s have a conversation about it with a written record, and add people to the conversation as needed,” so it’s a good place to get some feedback on designs.
Then, to go a bit further, when things go wrong, it is nice to have more than one person who understands what this change was about. Maybe the person who made the change isn’t around when the thing goes wrong, maybe they’re a little bit over their head. Having someone else that you can pull in and say, there’s more than one person who’s been involved with this, and they’re not trying to come up to speed when our customers are suffering pain, the better. Peer review is something that we take really seriously and any software change that we make goes through that process.
It also is a good validation step for, have you done your homework getting up to this peer review? There’s an expectation that all of our tests are passing, and that you’ve written new tests to cover the changes that you’ve made, so that the next person who’s changing code is not breaking anything of yours. Peer review is like a really good just human lens to catch a lot of kind of low-hanging fruit.
Julie Gunderson: Then how, or what advice would you give to organizations that say, “We just can’t do that. We just have too much regulation. Our deploys have to be scheduled. We can look at once a quarter,” but they want to change that. Where would you say to start?
Gautam Prabhu: Yeah, that’s a good question. There are definitely places where due to regulations, or just the reality of what they’re releasing and history. It’s hard to just say we’re going to go to a continuous release. The first thing I would say is it’s not a binary situation. It’s no, you’re on continuous delivery or you’re on one-year delivery. There’s places which have scheduled delivery, which I’ve worked with before, where it’s one week instead of one month or instead of three months. The first thing I would say is presuming that you see the advantages of going to a continuous delivery model, and you’re not bound by strict regulations, the first thing to do is just to try to increase the cadence of those deploys. If you’re on three months, ask yourself the question of what would it take to get to two months, and what are the parts of the process, which are really necessary and what are the parts of the process, which are just making it longer unnecessarily?
I think that’s one thing you can do, is you can try to bring it down over time. It’s not the end of the world to batch some things together. If you batch 500 things together, and then you’re able to get that down to 10 things, that’s still a huge advantage. The problems that I’m talking about, reduce in scope in terms of the number of things that need to be rolled back, the number of changes that could kind of interact with each other in some unexpected way. I think that’s one way to do it.
Second way is, a lot of times those long deploy cycles are representing a pipeline of people, not just the code. It’s like, well, developers write code and then hand it off to a QA team who validates code, and then if they don’t like it, it comes back, and that cycle is one that’s been used by many companies to ship successful software, but it’s also something where it’s not the only way to do it, and to get that sort of injection of quality earlier, you can actually embed QA resources within each individual team as an example.
That process of assessing quality can happen almost in real time as changes are being developed versus waiting for all of them to be done, and then sitting there on a QA team, taking 500 changes at once and having to test it like this all has to get tested together. Reducing that cycle time is possible by kind of rethinking the way that you assess quality and the people involved, in getting them involved earlier.
I think that’s one of the general lessons I’ve learned in software development, is the more that you involve people early in the process, the less you end up generating this sort of series of handoffs, which takes a long time. I don’t know if you’ve ever had a podcast on just the premise of agile, scrum-like development versus waterfall development. They have pros and cons. I’m not saying one is the perfect way to do it, but that’s kind of the same thing. If you want to get to an agile process, it means getting people involved at the beginning.
Julie Gunderson: I mean, I think that’s one of the main things that we talk about with DevOps, right?
Gautam Prabhu: Yeah. Absolutely. There’s all these handoffs that you can do in a traditional org from dev to ops, and ops out to production. PagerDuty is taking a different approach. That’s a really good point, Julie. We have this idea of build it, ship it, own it. Teams actually have embedded resources who are capable of getting things out from their laptop to production as a team. That sort of embedding of people who have understanding of how to get things to production in every team, lets you do it a lot faster, because otherwise you end up with a more traditional sort of team that’s like, this is the production deploy team, and they turn into the take a number team like, “Okay, well, we’ll get to you after we’ve gotten to these other 10 people.”
Julie Gunderson: I would say a lot of that goes back to knowledge sharing and an ultimate trust. You have to have a lot of trust in your folks. You have to trust that they’re doing things in the right order, that they’re running all of their tests.
Gautam Prabhu: Yep.
Julie Gunderson: Ultimately, it’s gotten PagerDuty to where we needed to be. We also have the benefit of kind of growing up in this DevOps culture as an organization versus a lot of the more legacy. Sometimes I’ll mention ITIL as a not so nice a phrase when we talk about processes that just maybe don’t serve people as much anymore. While there are still some things that make sense for organizations we’re now in the phase, and I mean, I think we all saw this during the pandemic, right? People needed to be able to change, and to scale, and to move fast, and having those super long development life cycles was not going to make organizations successful. I think a lot of organizations came to that conclusion.
Gautam Prabhu: Yeah. I think that’s a really good point. Just thinking back to earlier roles that I’ve had, you often, when you’re running a project, you ask yourself what’s the longest pole? What is the longest thing here, and then you set your process based on well, what takes the longest? There were jobs where sometimes that longest pole was, someone’s got to drive down to the hosting place, plug in machines and there’s a real sort of human cost of delay that’s introduced in that.
Then even in a slightly more modern version, it’s like, “Well, we’ve got to call the hosting provider and get compute spun up, so we’re going to base everything on around that.” But we’re now in this more modern world where you have these cloud providers, you can spin up compute very quickly, so the processes have to evolve for that, because probably what they were anchored in is not true as much anymore where there’s this huge upfront cost of spinning up infrastructure. That’s a lot of what I think has kind of pushed this kind of DevOps revolution that like you said, PageDuty’s been part of, because we’ve been cloud native from day one.
Julie Gunderson: Well, as we’re running out of time, I want to make sure that we get to the two questions that we ask every guest on this show. I think you actually kind of already addressed the first one, but I’m going to ask it anyway. What’s the one thing you wish you would’ve known sooner when it comes to running software in production?
Gautam Prabhu: I think the thing that’s most important is when you’re running software in production, you need to have a very, very healthy system for understanding what has changed, and a very, very quick system for correcting errors. Those two things, if you have those two things, you are able to run software in production responsibly. If you can figure out what changed, and if something changed that you don’t like, undo that as quickly as possible. You can run a service that is giving your customers what they need. It’s not really as much about the journey for how it gets out there, but it’s once it’s out there, do you know what to do with it responsibly?
Julie Gunderson: Excellent. Then, is there anything about running software in production that you’re glad I did not ask you about?
Gautam Prabhu: My first boss at my first job used to say something that was, I think mostly a joke, but it’s proven more and more true, which is like, “The world is our test bed.” Ultimately when software gets sufficiently complicated, you have these gates, you have these checks, you have these balances, but there is no test for your software of like having it out in production in all of your customer’s hands. It has the biggest blast radius, but it’s also the most in-depth test you can do. All software on earth that’s being released, ultimately is being tested by the customers in the end.
Julie Gunderson: Gautam, thank you. That is actually fantastic. I want to thank everybody for listening to us today. This is Julie Gunderson wishing you an uneventful day.
That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast, if you like what you’ve heard. You can find our show notes@pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number two that’s @pageit2thelimit. Let us know what you think of the show. Thank you so much for joining us, and remember uneventful days are beautiful days.
Gautam Prabhu is based in San Francisco and has been leading engineering teams for about 15 years now. He is currently the VP of Engineering for PagerDuty, where he leads product delivery and infrastructure. Prior to that, he was VP of Engineering for Shared Services/Infrastructure for Zendesk. Before Zendesk, he was VP of Engineering at a small startup called PowerReviews, which he helped start with 3 others.