Mandi Walls (00:09): Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me @lnxchk on Twitter.
All right, folks. Welcome back to Page It to the Limit. With me today, I’ve got Mark Hatch, and we’re going to be talking about, we’re going to make everyone feel old maybe. It’ll be great, but Mark’s going to talk us through some of his experiences with monitoring and learning over a long storied career. So Mark, welcome to the show. Tell us a bit about yourself and what you do.
Mark Hatch (00:51): Thanks, Mandi. Glad to be here. Mark Hatch, I’m currently a director in the infrastructure space, but I’ve spent my entire career, which is now about 25 to 27 years in the infrastructure space. It’s appropriate for me to be on a call like this because I’ve always felt myself as sort of the operational guy, the sentinel standing guard, and monitoring and alerting is at the crux of all of that.
So I’m just very passionate about the tooling, about the technology, but more important than that, I think the sort of cause of this and that is delivering value and performance and stability for customers. I’m very passionate about it. I’ve worked in a lot of different industries, so I’ve been able to see how automotive does it, how finance does it, how entertainment does it. I’ve been able to see a lot, and they’re all a little bit different, but there’s an undercurrent to it all. It’s just something I’m very excited about, and I’ve enjoyed my entire career focused in this area.
Mandi Walls (01:50): That’s awesome. Yeah, so as you have progressed and you’ve moved through things, where did you really get started with all of this stuff?
Mark Hatch (01:58): I started, I remember I was working at a custom build shop just in the nineties, and I was building custom build computers. And at the beginning, it was working on workstations. That was never enough. You learn about it, and the next thing you know… And there was jumpers at the time. I’m definitely here to date everyone. You had jumpers on the motherboard that you had to move around, but it wasn’t that complicated. It didn’t get very far.
And then, I remember on it, we had a shelf of a bunch of different operating systems, and I think I went through the path at the time that everybody else did. Everybody started with NetWare. You moved on to Microsoft because they had just released Windows for Workgroups. And then, I think every shop I had ever been in, there was this little disk on the side that nobody ever looked at and nobody ever bought it, and it was Linux. The first one for me was Slackware. I think it was when I got into Linux where it opened up the potential for the future, and I started seeing where I wanted to go with it and what it was. I wanted to be on that server side. I wanted to be in the back office and not working on workstations. So yeah, long time ago, building PCs and just looking at a shelf of software is what inspired me.
Mandi Walls (03:13): That’s crazy. Yeah, I remember those days. Slackware, parts of it would come in the Linux Journal or whatever. You’d get the discs in the mail with the magazine, or somebody would bring them to a swap meet or a Linux users group, and there’d be a stack of discs. Or somebody might have had a DVD burner at the time, or a CD burner at the time, that you pass around. And yeah, it was a different era for sure.
Mark Hatch (03:39): Yeah. I actually remember, about a year and a half after I started down this path, I realized that I should not have spent money on that CD. It was the entry point, so I’m happy that I spent the money, but I very quickly came to realize that I’m not using the CD ever again. I’m going straight to distro.
Mandi Walls (03:57): Yeah, definitely. I know I did too. I had a butterfly pack of Red Hat discs at one point, and I’m like, yeah, I love them, but I’ve used them once or twice.
Mark Hatch (04:10): Yeah, yeah. Exactly.
Mandi Walls (04:12): So that changes your user focus. If you were started out with workstations, the user experience, the customer experience you’ve got there is super different from working in the backend and working on servers that users don’t necessarily know that they’re back there. So how does it change how you feel about customer experience and putting things in front of folks?
Mark Hatch (04:34): I think it’s about a disconnect. I think when you’re working on workstations, the user is always directly interacting with it. They know what they want, they know what it looks like, they know what it feels like. But then, you start looking at the server and the hosting side, and they’re disconnected enough from what’s happening that the reports aren’t direct anymore. It’s not, “This is happening,” it’s, “I’m having this weird behavior. Can you look into it?”
I think the other one, and I think topical for the focus, is finding out about it became very difficult. When a user is having a workstation issue, they’re on the phone, or they’re walking over or whatever. But on the server side, I think the user’s tendency is, “Ah, there’s something might be going. It’s on me.” So it takes a little bit longer. That’s the important part of the beginning of this particular journey is how slow and cumbersome everything was in the beginning, and it’s because you didn’t have that direct interaction with everything. The users were a little bit more obfuscated away.
Mandi Walls (05:36): Absolutely, and we know things are different now. What did you see were points where things started to change? Or were there major points where there were inflection points, and all of a sudden, things got better? It felt like we had used the same tools for a very, very long time, and then all of a sudden, stuff changed, but that was what I saw. But what did you see on your path?
Mark Hatch (05:57): From my perspective, I think the big shift was actually the internet taking off and shifting out of dial-up and shifting away from direct access to resources and starting to use middleware. I remember in the beginning of it all, the first piece of software that I used was a tool called Big Brother.
Mandi Walls (06:17): Yes, okay.
Mark Hatch (06:18): Yeah, side note, us technologists love leaning on the negative for our names, right?
Mandi Walls (06:23): Definitely.
Mark Hatch (06:24): We like to co-opt it, right? Big Brother is a bad word. No, I’m going to make it a good word now. Early on, there was so much friction because you had to have a modem bank in your system. You had to directly configure all of the different users and their phone numbers and all this other good stuff. They weren’t the smartest tools. It was up down, so it wouldn’t know that, for example, this router went down, so you can ignore these 500 other alerts.
The other part of it is just the distance between you and the problem. I don’t know how many people remember this or had this experience. I presume it’s all of us that we were doing it at the time, but there are speed limits, and nine times out of 10, businesses don’t care about speed limits when you’re in the middle of an outage. I can’t tell you how many times I got one of my first pages, and I had to get in my car, and I had to drive to the data center. And the whole time you’re on the drive, you’re thinking to yourself, “All right, that’s another minute to the incident. That’s another minute. That’s another minute.” And then, you get there, and you’re trying to get through the card reader.
With that starting point, I think the really, really big inflection was when the internet really got its footing, and it stopped being a fad for us technologists. We got two things. One was the development of remote access. And then, number two was, instead of me managing the modem bank, I could ship that off to a PagerDuty where they’re making the calls. Or maybe not even PagerDuty, but I’ve used them since they existed, I believe. But the idea that I could send a message to the provider and ask them to send an SMS message or a text message as opposed to me having to send it directly, so making that call made it infinitely easier than picking up the line, doing the dial, sending the message.
Mandi Walls (08:14): Yeah, you don’t have to maintain your own POTS connected, telephone connected stuff all the time. It was bad enough when you had incoming dial-up modems, but then your other stuff that was dialing out for things, a huge pain in the butt.
Mark Hatch (08:29): Yeah. And the next one on that is, I remember those systems were relentless. If you had 50 alerts, you were getting 50 alerts, and it wasn’t like you got a blast of 50 messages, and you got one alert saying, “Hey, you just got 50 messages.” It was alert, pause, alert, pause, and it kept going. And I remember there was occasions where I solved a problem, but I was still getting pages for 30 minutes afterwards, and it’s like, “Go away.” But as the person who cares about this and is passionate about it, I can’t disable the alerting. I can’t stop my trending. I need all of that stuff to be processed through. So yeah, it was very, very cumbersome at the beginning.
That said, it was so incredibly exciting. I remember talking to my brothers at the time and just telling them. I would get a message, and they’re like, “What’s that?” And I’m like, “I configured my servers to talk to me when they’re having a problem.” It felt like this huge leap, where it was just like I didn’t need a human to tell me. But at the same time, the minute you give a computer an inch, it’s going to annoy you until you, you know…
Mandi Walls (09:44): That’s pretty true, but great. That time, it kind of felt like everything you were doing was sort of super new and that maybe there were a handful of other folks out there you were maybe talking to on Usenet or IRC or whatever, and they sort of had something maybe similar, but a lot of environments were completely and entirely unique. There wasn’t really any kind of overlap, even for folks trying to attempt the same sort of things. Everybody had so many individual components and had to structure things sometimes to fit in the closet and sometimes the limitations of the building and other weird stuff that always came into play.
Mark Hatch (10:24): Yeah, absolutely, and I’ll take that a step further. I worked at an automotive company. One of the fun things about that is, at the time, things have changed, you’re usually dealing with all of the big automotive firms, which means I wasn’t supporting an operating system. I had, at one point in time, like 90% of all Unix operating systems in one business. All of them are a little bit different. They require a little bit different attention. It’s just that complexity. Not only is every business a little unique at that time, but it was even inter-business or intra-business. There was this uniqueness that, that’s what we do. That’s what people expect of us. You know what a keyboard looks like. Can you work on every operating system out there?
Mandi Walls (11:11): I can’t even work on every keyboard out there. You’re going to meet somebody who’s using Dvorak or somebody who came in from the remote office and has a European keyboard and have no idea how to do anything.
Mark Hatch (11:22): Yeah, the user-specific tools, always a good time.
Mandi Walls (11:26): Absolutely. So how does that experience inform or influence how you work on things today?
Mark Hatch (11:33): The big influencer for me was quality of life.
Mandi Walls (11:37): Oh, nice. Yes.
Mark Hatch (11:38): I like to categorize people, and no shade in this statement. I respect all technologists, but I like to categorize your infrastructure folks in one of two buckets. You have your, I’m just going to use the term sentinel, and these are the people that love to stand guard. They’re kind of like the firefighters of our world. But then, on the other side, you have the architects and the engineers, and those folks don’t so much want to be interrupted. They want to be heads down, and they want to work on things. And again, I have great respect for all of them. I absolutely classify myself as a sentinel. There’s probably a little bit of hero complex in there, but that’s a conversation for a therapist.
Mandi Walls (12:14): Definitely.
Mark Hatch (12:15): That said, I don’t want to sacrifice my health, my wellbeing, my work-life balance, or those of my individual contributors as a director. So what it informed me on is what it meant to be healthily standing guard. There’s this idea that, I’ve been in organizations before, and it’s an interview question. How many alerts do you get on average? Every interview that I’ve been in since, I’ve done it. Even interviewing as a manager, I want to know what’s that discipline. So the false positives are a quality of life problem, overlapping detectors and that sort of thing is a quality of life problem.
I think the other part of it is I understood the value of response time. Early on, there was a lot of tolerance for problems more often than not. Over time, we became less and less tolerant, a little bit of downtime. I think nowadays we don’t even talk about our nines because it’s just everybody assumes a hundred percent, right? You’ve got PagerDuty, so that means that it’s a small aspect of your business that’s impacted, and that’s it. Right? You’re still up at all times, and to some extent it’s true.
So for me, it’s really shaped how, and it was me driving the charge for my own quality of life. And then, when I got into management, it was putting my foot down and ensuring that these were business and team objectives. No monitor sends an alert until it’s actionable or until you’ve understood it, so now we have a quiet period, where you have a monitor built. It’s not going out to wake somebody up in the middle of the night. You observe over time, and you wash away those false positives.
The other part of it is the response time being how quickly I can get to that because that impact becomes more and more relevant over time. So I want to be able to get the message out, and I also want to be able to enable people to triage quickly. So in the past, you got the alert, and then that was it. But I wanted to be able to silence the alert. I wanted people to know that it’s being worked on, that it’s getting attention, and I don’t need more phone calls. So we had to build sophistication into tooling and processes in order to eliminate the false positives, building out those quiet periods, making sure that the alerts were actually actionable, and then making sure that people are aware, yeah, we’re on it. We understand that there’s the case.
Conversely, I think one of the worst experiences for me is you’re in the thick of a problem, and everybody wants to talk to you about that problem. I need to increase your sense of urgency, but it’s like this is all I’ve been doing since I got the alert. That said, people could be forgiven for getting emotional while they’re under duress. Right? I’ve never met the owner of a business that was just cool with downtime.
So to my mind, it’s about, how do we facilitate that conversation or that status quickly? But ultimately, it all rolls up to this idea that I want to enable those sentinels to not be drones to the system so that you’re there, you’re aware, and you’re responsive, but you’re not constantly subservient and waiting.
Mandi Walls (15:30): We talk about that a lot with folks as we’re bringing them on board to things like PagerDuty, and it can be hard. And I have definitely worked with operations folks, assistant administrators in the past who are very, I’ll say they’re nervous on-call engineers, where they’ll constantly have whatever monitoring system up, and that’s what they’ll be concentrating on, and they’ll be completely useless for any other kind of work. And honestly, you probably don’t want to be doing a lot of other work when you’re on call, you’re going to get interrupted.
But they find themselves really just glued to those blinking lights way too much, and that’s no way to live either. That’s just not a healthy practice. That’s why things are going to ping you on your phone. It’ll let you know when there’s something wrong. You don’t have to watch that stuff and make yourself nervous all the time. I think it’s a hard thing for certain kinds of folks to do. Like you say, they’re emotionally invested in that sort of stance, and it can be super hard on folks.
Mark Hatch (16:28): Yeah, absolutely. I think some part of that is conditioning, though. I think about the late nineties, early 2000s and NOCs, and I don’t hear many people talking about NOCs or SOCs or any of that other stuff anymore. I know that they’re still out there. I know that they still exist, but we’ve all shifted into the cloud native world, where that’s really more people that are in a physical space.
But at the time, the conditioning was, that’s the job. Because we didn’t have tooling, you had to stare at them. And sometimes, what you had is, you saw with your eyes detecting a drift in the trend and then actioning on it. And at that time, everybody didn’t have the same level of sophistication with things like historical trending and comparison, and algorithms that understand the history, and anomaly detection, and all that other good stuff. So you needed that group of people that was just sitting there staring at it, obsessing.
And I think a lot of what happens on the operational side of things is basically conditioning. So where you start, what your first job is, is going to have a considerable impact on where you end up going. I didn’t work in a NOC, and for that reason, it was always flipping back and forth. I had the on-call, I had my network management system over here because the network did it first with SNMP, so everybody just kind of leaned in on SNMP. So early on, it was all NPM, but we were looking at the operating system side and the software side. At the same time, I was doing the engineering work as well, so I would look away, but there was a lot of people that came into this industry that started staring at glass.
Mandi Walls (18:09): That’s where a lot of the jobs were at the time.
Mark Hatch (18:12): It’s a great entry point as well, the try it before you buy it concept on both sides. You work in a NOC, and you can find out what area you’re going to explore, where you’re going to get. Nobody goes into technology or computers and says, “I’m going to do that one thing.” You might be a network engineer, and you might be a system admin, you might be an operator or help desk, you might be a software developer, but you can get your foot in the door anywhere and then have an opportunity to find out where your passion is, what you’re good at. The NOC concept was really where a lot of early people, people that have been around for a long time anyways, that are old like me, it’s really where they got their footing, so it’s where that early conditioning came in. So I think it’s a result of that.
Mandi Walls (18:57): Awesome. No, that’s a good point, and like you say, that was the job. You could get into the NOC, and also in some places, into POP maintenance for dial-up ISPs. Those were the kinds of jobs you could get in high school or without going to college. You could work your junky retail job and help out the local BBS or whatever on your evenings and gave a lot of folks their early entry into what is now the internet, but was then mostly a bunch of nerds dialing into other computers, other places.
Mark Hatch (19:31): Yeah, absolutely. Absolutely. It’s been a wild ride. It’s been a fun journey. For me, I actually started really early on and got excited in sixth grade math class-
Mandi Walls (19:42): Wow.
Mark Hatch (19:42): … because I could start developing on Apple Basic.
Mandi Walls (19:46): Okay.
Mark Hatch (19:46): And I fell in love with it instantly, and obviously there was a lag between that and my first job. I like to tell people I got lucky. Not many people know in sixth grade where their career is going. And I knew at that time that it wasn’t software development that I was passionate about. That’s for creatives. I’m not creative. Like I said, I’m the sentinel. I’m going to stand watch. I’m going to understand the nuances.
The thing that kept me there rather than wanting to look elsewhere was my penchant for after the alert comes in. After the alert comes in, the real work begins, and that is what’s going on, and getting to root cause analysis and that sort of thing. That’s one of the things that I love the most about this job. We know everything about the origin of any piece of technology because it all started with Ada way back when she envisioned binary Boolean logic. And from that jump point, literally everything was a human creation and idea. So you can more often than not pinpoint where one person 30 years ago made a decision, where it’s like, “Oh, that’s why we’re still talking about HTTP.
So yeah, to that end, for me, it was getting to that root cause analysis. It was fundamentally understanding the best times I ever had at the job, and nobody is going to agree with me on this. I’m unique, and I’m okay with that. It was when it was a really nasty problem, and there was 14 different systems involved. At this time, when it got to that level of complexity for me, I had spent 18 years on the systems administration side, and then I was on maybe my eighth year of the network engineering side. So I had played with all the operating systems, all the hardware on the system side, and now I’ve got access to the firewalls, the load balancers, the switches.
Mandi Walls (21:37): All the fun stuff.
Mark Hatch (21:38): So a problem comes in, and there is no part of the stack that I can’t jump into and have an understanding of. To that end, I was able to jump around fast enough that I personally would reduce the MTTR. I love that. Like I said before, I like being that sentinel. So I personally was able to reduce that MTTR by having that profound understanding of everything, and that was really solidified.
I know a lot of people that they want to elevate into the architecture, or they want to go into software development. I love infrastructure. I love infrastructure. I’ll always love infrastructure. I don’t say this to CEOs, but I say this to peers. I love problems because they’re challenging. Every one of them hopefully is unique and requires a little bit of a different thinking, but you’re under the gun. Time is of the essence, and when it gets into a challenge, that’s when I hunker down and have a great time.
So there’s a lot of personality that goes into it all. It’s not for everybody, and I’ve met people where this isn’t for them. The page goes off, and they get stressed out, or they need a break in between, and I respect that. Some of us are just better heads down doing engineering, building for the future, and others of us are just good standing guard and making sure that the customer is happy and the business is working on new features and not fighting fires.
Mandi Walls (23:03): Yeah, definitely. So you mentioned the future. What are you looking forward to in this space in the future?
Mark Hatch (23:09): We’re seeing a lot with respect to machine learning, where we can do better by way of anomaly detection. PagerDuty’s grouping feature, in my opinion, is still in its infancy. I envision a time where the tools can just flood the escalation program, the PagerDuty, with information, and it can say, “Oh, we’ve all seen this before. This is the alert that you need to see.”
One very, very underutilized feature in this space is a reactive action, which is to say, I just got an alert, and I want to take an automated action to correct it. The technology is there, but maybe we’re skittish about it. Maybe we don’t have the opportunity, but auto healing, right? Do you remember the IBM commercials from the back, magic server pixie dust?
Mandi Walls (23:58): Oh my God, yeah.
Mark Hatch (24:00): I never liked the commercials, but I was always interested in getting some of that dust.
Mandi Walls (24:04): Yeah, right. It was such a big thing for a while. I want to say it was like 2006 maybe, just before the cloud really got going, that there was a lot of this self-healing stuff. There was a bunch of research going on, and there was just a lot of energy around that, and we just never really got there. And like you say, I feel like we’re finally on the cusp of having enough experience with the right kind of automation that you can trust and being smart about how to deploy it that, yeah, I’m hoping that we get to see more of that.
Mark Hatch (24:37): Yeah, absolutely.
Mandi Walls (24:38): Awesome. So one of the other final questions we like to ask folks on the show is, is there a myth or a common misconception about operations or monitoring or alerting or any of that kind of stuff that you hear too often that you’d like to debunk for our listeners?
Mark Hatch (24:56): Yeah, I think so. I’m hoping I can get a little bit of leniency in this respect because it’s more…
Mandi Walls (25:01): Okay.
Mark Hatch (25:01): It’s more tangentially related, but the idea that any production change at all is easy.
Mandi Walls (25:08): Oh, okay.
Mark Hatch (25:09): So I can’t tell you how many times we’re talking about something, we’re talking about a fix or a change that needs to happen. Somebody on the call says, “Oh, that’ll be easy.” That always makes my ears perk up because I think that’s the wrong language. I think it puts us in this relaxed posture, where it’s like, oh, set it and forget it. When the truth of the matter is, Murphy’s Law suggests that you didn’t take something into account. So something as simple as a DNS change…
Mandi Walls (25:38): Something simple about DNS changes, they’re just not.
Mark Hatch (25:42): Exactly. There’s so much to it. And is it a CNAME? And what’s the origin record? And did you get the typo right? Do you need a trailing period on this software, or do you not need trailing the period?
Mandi Walls (25:53): Did you increment the thing, the counter at the top of the file?
Mark Hatch (25:55): Yeah, yeah, yeah. Did you get your formatting right? All of it. So something as seemingly simple as a DNS change, I’ve seen them go incredibly awry, where you make A DNS change, and at the other end of it, you’re doing your root cause analysis, and you’re like, “Well, I assumed.” When we say that something is going to be easy, we trick ourselves into thinking that our assumptions are sufficient. So from an operations perspective, I would like to convey this idea that, while it might be easy, a better way of phrasing it would be that this shouldn’t be difficult. It shouldn’t be problematic or something like that, but kind of flip it on its head to acknowledge the follow-up that needs to happen.
I think the worst thing that you can do in production is set it and forget it. You do a code change, you do an infrastructural change. It’s important to have established this idea that I’m going to sit, and I’m going to watch that and make sure that there’s not something that I didn’t take into account along the way.
Mandi Walls (26:58): Absolutely. We all have horror stories about making a change, and then it feels all right for a couple of minutes. And you go to lunch, and then everybody’s trying to get ahold of you because something’s not right.
Mark Hatch (27:07): Yeah, yeah, exactly. The best, you’ve got the food in your mouth, and all of a sudden, it’s an emergency.
Mandi Walls (27:13): Yes.
Mark Hatch (27:14): Hopefully you brought your laptop because nowadays, you can remote in.
Mandi Walls (27:17): Absolutely, 100%. This has been great, Mark. Is there anything you’d like to recommend for folks or any pieces of well-earned advice that you can give to our audience?
Mark Hatch (27:30): I think everybody’s got some advice, so I’m not going to offer any. Trust yourself. Trust your intuition. I think far too often people don’t trust their gut. For that reason, they slip. They had a feeling, and they let it go, and then it slipped, and then all of a sudden, they were paying the price in the future. So I guess I just lied. My advice is trust your gut, follow your intuition. We don’t do that enough.
Mandi Walls (27:51): Awesome. That’s excellent advice, 100%. Your intuition is the culmination of all the things you have learned into one big glob of things that you know and feel, so that’s great.
Mark Hatch (28:01): That is a fact.
Mandi Walls (28:02): Awesome. Well, Mark, thank you so much for being on the show. This has been great.
Mark Hatch (28:06): Thanks for having me, Mandi. I appreciate it.
Mandi Walls (28:08): All right. That’s it for this episode. We’ll wish everyone an uneventful day.
That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com, and you can reach us on Twitter at Page It to the Limit using the number two. Thank you so much for joining us, and remember, uneventful days are beautiful days.
Mark Hatch is a Senior Director of Engineering at Recharge. He has spent his career building and supporting production Infrastructure with a focus on procedural, data-driven Operations driven by transparency in Engineering. His resume includes roles across a wide set of industries including eCommerce, Automotive, Financial and Entertainment where he was able to develop many different points-of-view on the singular subject that is supporting software in production. In his free time he can usually be found outside, exploring or trying to fight back nature, or inside, playing games of all kinds with his family.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.