Melting servers, disaster recovery and AI agents with Laura Morgan

Transcript

Daniel Afonso (00:09): Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices using software and industry to improve both systems, reliability and the lives of people supporting those systems. I’m your host Daniel Afonso, and you can find me on X, LinkedIn and Bluesky at danieljcafonso. This episode was recorded at PagerDuty [00:00:30] on Tour London. If you’re listening to it before April 2nd, 2025, you can still join PD On Tour in person. Here are the upcoming dates, Sydney on April 2nd, Tokyo on April 10th and San Francisco on April 15th. To register, just head to pagerduty.com/on-tour and pick your favorite location. I’m looking forward to seeing you there Now onto the episode. Okay, everyone, welcome to another episode of Page It to [00:01:00] the Limit. This one is a special one because we are here at PagerDuty On Tour in London, so we’re going to have very special episodes today. Some of our speakers are joining us to talk a bit about their experiences, having a fun questions, and yeah, that’s it. So we have our first speaker here. Would you like to introduce yourself?

Laura Morgan (01:19): Hi, yeah, thanks for having me. My name is Morgy or Laura if you want to be formal. I am the reliability area tech lead at Spotify. So [00:01:30] I am accountable and responsible for reliability strategy at Spotify. Yeah,

Daniel Afonso (01:35): Okay. And what are some things you’re passionate about? Tell me about yourself.

Laura Morgan (01:41): Yeah, I’m passionate about quite a few things. Reliability is probably pretty far up the list actually, but outside of that, really passionate about music, so things like cycling and sports, but for me, music is probably number one. And so a big part of my passion [00:02:00] is making sure that the service we provide at Spotify is working well for our customers because I’m one of those customers and I care deeply about it.

Daniel Afonso (02:09): That’s really cool. Passion for music, Spotify just feels like it’s a perfect match.

Laura Morgan (02:14): Yeah, definitely.

Daniel Afonso (02:15): Okay. And now we’re here on PagerDuty On Tour here in London. So could you tell me a bit about what you’re going to be doing here?

Laura Morgan (02:23): Yeah, so I’m actually here for a panel session. I’m going to be joining another person from a company [00:02:30] called Wiz, and we are going to be talking about mitigating the risk of outages and digital disruption. And that’s a panel session at about two 30 today.

Daniel Afonso (02:41): Okay. And for people watching, the recordings are going to be listening right now, sorry. The recordings are going to be available online so you can check them afterwards. So we’re not going to ask questions here focused on that session because well everything is going to be there, so we encourage everyone to check them out. So I would like to have some [00:03:00] fun here with you right now. Let’s have some fun questions, some more light to add stuff. So because you’re going to be on stage, you’re going to be talking to a bunch of people that we have here and sharing your experience, I would like to know if you have any fun incident stories you would like to share with us.

Laura Morgan (03:14): Oh yeah, so many. I feel like I could be here for hours talking about that, but there is one that always sticks with me that I really like to talk about. And it was pretty early on in my SRE journey. So I used to work for a very [00:03:30] small bank, sort of offshore bank on the Isle of Man when I lived there. And I was the infrastructure manager, if you want to call it that. We had two tiny data center rooms. It was very much an old school setup. And yeah, this was probably my first proper incident. So I got a call at about six, seven o’clock in the morning from the managing director of this branch and she said, Hey, [00:04:00] your room is making a lot of noise. And I said, what do you mean my room is making a lot of noise? And she said, she’s called Caroline.

(04:11): Awesome, awesome human being. She said, yeah, so the room you spend a lot of time in the one with all the servers, it’s really, really noisy. And I was like, oh my goodness. So I asked her on the phone as I’m sort of rushing to get ready to run down the road and get into the office, [00:04:30] I asked her to just put your hand to the door, don’t open it, but put your hand to the door and tell me does it feel warm to the touch? And her answer was yes. So as you can imagine, I sprinted out of my house down the road into the office and the air conditioning had failed in our server room, which would’ve been okay because we had DR tapes, we did regular testing, all of that. So it would’ve been fine, but we were in [00:05:00] between DR tests at that time.

(05:03): So our two banks of tapes were in the server room. So you can imagine there’s a lot of learnings coming from this that I will go into shortly. So quite a few of our servers were completely destroyed. The heat just really did a lot of damage to some of the older servers in there. We ended up, everyone who worked in the office was just bringing fans from their desk and fanning cold air from [00:05:30] the office into the server room while we were waiting for the maintenance people to come and fix the air conditioning. Obviously it was so small there. We didn’t have argon or anything like that. We didn’t have gases that could have helped in this situation. But yeah, nothing caught fire, thankfully. But some things did melt, the tapes included, they just did not function unfortunately. But we had an older backup from a few days ago, so we lost some data. Luckily the mission critical data that we [00:06:00] needed was elsewhere. It was on the internet, it was on people’s computers. It was fine. But yeah, overall I think it was just an interesting, very fun, also slightly scary moment. And what was very fun was that the one system that did not get affected by this was the dot matrix printer, which used to print out, I don’t even remember what we used it for. This was so long ago. But yeah, [00:06:30] the dot matrix printer for some reason was absolutely fine. Old tech, probably good, I dunno.

Daniel Afonso (06:37): Well that’s scary. And even for your first incident perspective, I just imagine if something like that had happened to me, I would probably panicking if it was my first incident of sort not being expecting, not being prepared to this. So can I ask a follow-up question here? So how now a couple of years after, how would you prepare [00:07:00] mentally for these type of steps? Because obviously probably the first thing that comes to your mind is panic. So how now do you approach incidents in this perspective or something in this scale again, how would you do it now differently?

Laura Morgan (07:16): Yeah, so this is really interesting just looking back on this, I realize now that as we talk, this was 20 years ago, so a lot has happened since then. But there’s a couple of things that I take from it that have stuck with me throughout that [00:07:30] 20 years. And that incident really made me realize that this is what I want to do. I realized that it wasn’t a panic for me, it was just I need to make sure things work for our customers and I need to make sure that our employees are able to do their job as quickly as possible. And that was just the first thing and the only thing that was in my mind. And I realized from that I actually enjoyed it. I actually enjoyed the pressure. I enjoyed [00:08:00] problem solving, all of this stuff that comes with that. And I think that was the one moment that I realized this is the right career or the right space for me, what comes after that is there’s so much, I mean, proper disaster recovery testing, we were doing a good job there, but we were storing the data in the same place as the main failure mode.

(08:22): So yeah, being careful about spreading your failure domains, making sure they are small and [00:08:30] many as opposed to single and large, I think is one of the key things that I get from that. A second thing is just redundancies as well. We only had one air conditioning unit functional in there. And yeah, it just gave up. It was an old building, it was an old system, and having a redundancy would’ve probably mitigated that very, very quickly and meant that I didn’t have to get up too early in the morning. And I think also just the collaboration [00:09:00] part. One thing I realized in that moment that I always remind myself of is it’s okay to ask for help, especially when things are high stress. And it doesn’t necessarily matter if that person isn’t another SRE or cis admin. It doesn’t matter if they’re not an expert in your domain. I had people that were customer service agents helping me with the servers and helping me unrack things. And it’s okay because we’re all in this together. [00:09:30] We all have the same mission. We all care about our customers and we care about each other. And I think that’s the key thing I think that I find very beautiful in reliability engineering in general, is that ability to ask for help to help each other and build comradery for what we’re trying to do. Yeah,

Daniel Afonso (09:47): That’s really, really great. I’m standing in front of you, so for those of you who are listening, we’re standing in front of each other in a room, you can see the passion in your face when you’re talking about this thing. So I cannot help myself. Again, [00:10:00] I’m just smiling and in grinning because it shares the enthusiasm. I can completely feel what you’re feeling there. So thank you so much for sharing this story. So still on the topic of incident response, how would you explain incident response to a five-year-old?

Laura Morgan (10:16): Oh my goodness. Yeah, I’ve thought about this a little bit and I really, I’m not particularly great at communicating with five year olds. They seem too smart. [00:10:30] They’re just a little bit

Daniel Afonso (10:31): They ask a lot of whys.

Laura Morgan (10:33): Which is good, which is healthy. And I would start with that. I would maybe say to a five-year-old, an incident management or just overall dealing with crises in technology is a moment when we are able to solve a problem for people that really care about the outcome. Maybe that’s a little bit too complicated, I dunno. But I would say to that [00:11:00] five-year-old, do you like coloring in? Do you like watching cartoons? And I would say to that five-year-old, what if you weren’t able to do those things? They would be upset. So I would say to them, that’s what we want to do with incident management is make sure that you’re still able to watch those cartoons or use that coloring book or do the things you like to do. And how we do that is first of all, by asking each other questions. So can I have some help?

(11:28): What’s happening here? [00:11:30] Do people know what’s going on? Do I need to call someone and tell them? So I would maybe call mommy and daddy and say, the cartoons aren’t working. So that’s the first bit communication piece. The second bit I would say is sitting down and spending time trying to figure out how to fix it. I think that’s something a five-year-old would be able to understand. Okay, so we have a tv, maybe we need to make sure the power is plugged in. Maybe we need to see if the cable [00:12:00] is plugged in, all of these things. And then when we do fix it, what do we learn about what happened? So did we accidentally pull out the power socket or the cable or did we knock something over? So that’s the learning part, and this is where we can make sure that these things don’t happen again. So I think there’s a lot of analogies to day-to-day interactions that maybe a five-year-old would have that you can yeah, definitely draw into incident management and crisis response, I think.

Daniel Afonso (12:29): Yeah, definitely. [00:12:30] And that also feels like there’s some learnings that they can take from that and apply in their life at the end of the day, there’s also practices that we can bring in our day to day. If we start talking about postmortems and talking about blameless culture, these are things that every human and everyone should take with them in their day-to-day. So there’s probably some wisdom that we can take to some five-year-olds from this I

Laura Morgan (12:53): Think. So I genuinely believe that being an SRE has made me a better person overall just because of all of these things that we [00:13:00] talk about.

Daniel Afonso (13:01): Well that’s great. So I have two more questions before I’ll let you go. So everyone’s talking about AI now. That’s the thing, now we are in 2025. There’s also the discussion of 2025 being the year of agents. What I wanted to know is what can you share right now about Spotify and AI integration in your digital operations right now at the moment? What’s the current status of it? What are some difficulties that you found [00:13:30] and where do you predict that AI will help you and bring more improvements in the next years, next couple of months for you?

Laura Morgan (13:39): I love this question. If you’d have asked me this two years ago, I would’ve been so, so skeptical. But I have changed my mind. I am no longer a Luddite in this space. So at Spotify, I’m really excited at the minute because we are starting to engage a lot more with this idea of agentic ai, [00:14:00] how we can leverage the support functionality of AI and machine learning a lot more. And obviously we’re working with PagerDuty quite closely on what we can do in that space as well. You’ve mentioned this is the year of agents, and I think that’s very, very true. I’m very excited to engage with the prospect that during an incident when we get woken up at night and we’re tired and we’re stressed having that sort of friend almost to [00:14:30] answer questions when we might need some extra help, troubleshooting, things like that I think are going to be such a game changer.

(14:38): Just the idea of being able to say, has this happened before? Do we see any other anomalies in our observability stack? And just build those queries in a natural and human way I think is be such a huge, huge win for engineers when they need to feel collected and they need that [00:15:00] support. They need human support as much as they need compute support at this point in time, I think. So I’m really excited about that and I’m even more excited about what comes in the future. So for us at Spotify, I think we’re starting to learn that we need to build very good data structures around reliability. So we have backstage or developer portal and we have those folks here today as well. So if anyone’s listening [00:15:30] to this and they’re in the room, please go check out our backstage partner booth. But the data structures within Backstage that mirrors our software graph, we have team data who owns what in there, that sort of thing.

(15:44): We haven’t really looked at how we bring that into reliability and crisis management and incident management. So now we’re working very, very deeply with PagerDuty to sort of bring our data structures in reliability platforms into the same [00:16:00] space and up to the same baseline that we have for our software ecosystem. And I think that’s going to be the big game changer. That’s the big learning that we have that we need to do that to be able to make our use of AI and machine learning a lot more clean and a lot more efficient in the future. And what comes after that, I think that I’m most excited about is this idea of potentially getting to a self-healing architecture sort of environment that blows [00:16:30] my mind. I don’t think we’re that far away from it either. Again, it’s this data structures, making sure things are clean and neat and healthy and not over complicated, but you have a good observability stack, you have a good alerting stack. Things like PagerDuty, those sort of things in your environment, getting to a point where you’re able to have this pattern matching, this anomaly detection, look at your environment also [00:17:00] and suggest a change that could prevent a fault down the line. Or even automating that. It doesn’t even compute in my mind how huge that is potentially. And we’re so close, I think.

Daniel Afonso (17:17): Yeah, and it’s really exciting because like you said, if we thought about this two, three years ago, we were like, yeah, there’s no way that’s happening anytime soon. And now it just seems like it’s so cool that we can grasp it.

Laura Morgan (17:29): Oh yeah.

Daniel Afonso (17:30): [00:17:30] That’s so cool. So okay, we’re about finishing our time and, I’m going to have to let you go so you can get ready for your session. What is the one thing that you’re currently excited about and looking forward to exploring for the rest of the year?

Laura Morgan (17:44): What I mentioned there about how do we get to a more proactive state with incident management and reliability. I think merging all these things together, like lifting toil away from developers, how to make them [00:18:00] more effective in what they’re doing and therefore building a much more beautiful and reliable experience for our customers. There’s that, but also marrying that to this notion of proactivity. I think in reliability engineering in the industry in general, for years we have been talking about how do we get to a proactive state, and I don’t think anyone’s really done it in the way that they want to or really done it at all. [00:18:30] And I think now is the time in the industry that we will actually start to be able to answer that question of what it means to be proactive, what that means for our customers, what it means for our business strategies, but also what it means for us as human beings. How do we get people to feel less stressed at work? How do we lift time away from them to be able to focus on the things that they really want to be doing, like building these amazing products for their customers. So yeah, [00:19:00] that’s what I’m excited about this year. It’s a bit visionary, but I think we’re going to get there. Yeah,

Daniel Afonso (19:06): That’s really exciting. So before we close up, is there any place that people can connect with you, they can find you online that you would like to share?

Laura Morgan (19:13): Yeah, so you can find me on LinkedIn. My name is Laura Morgan. I’m at Spotify. You’ll be able to tell that’s me on there. Please drop me a message anytime. I’d be happy to hear from you. And if you want to nerd out or about reliability, I’m always here.

Daniel Afonso (19:26): Thank you so much for joining us and for everyone listening, Laura is still going [00:19:30] to do their session here at PD on Tour, so you can go and check the recording. It’s going to be here on the links for the podcast. So thank you everyone and have an uneventful day. That does it for another episode of Page It to the Limit. We would like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast on your favorite streaming service. If you like what [00:20:00] you’ve heard, you can find our show notes at pageittothelimit.com and let’s continue this conversation on PagerDuty Commons at community.pagerduty.com. Thank you so much for joining us and remember, an eventful days are beautiful days.

Melting Servers, Disaster Recovery and AI Agents With Laura Morgan

Transcript

Show Notes

Additional Resources

Guests

Laura Morgan

Hosts

Daniel Afonso (he/him)