Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting their system. I am your host, Mandi Walls. Find me @lnxchk on Twitter.
Mandi Walls: All right, welcome. This week on Page it to the Limit, we have a crossover episode. If you haven’t heard part one of this episode featuring Kolton Andrus and Alex Solomon, you’ll need to find it. It’s on the Break Things on Purpose podcast from our friends at Gremlin. So, you’ll find that at gremlin.com/podcast. You can listen to that episode and then come back and listen to our episode this week as we join the conversation in progress.
Julie Gunderson: There’s a lot of crossover—just like this episode, folks—between both Gremlin and PagerDuty as to how they help organizations be better. And again, going back to building a better internet. I mean, Alex your shirt—which our viewers—or our listeners—can’t see, says, “The world is always on. Let’s keep it this way,” and Kolton, you talk about reliability being no accident. And so when we talk about the foundations of both of these organizations, it’s about helping engineers be better and make better products. And I’m really excited to learn a little bit more about where you think the future of that can go.
Kolton Andrus: You hit it though. Like, the key to me is I’m an engineer by trade. I felt this pain, I saw value in the solution. I love to joke, I’m a lazy engineer. I don’t like getting woken up in the middle of the night, I’d like my system to just work well, but if I can go save some other people that pain, if I can go help them to more quickly understand, or ramp, or have a better on-call life have a better work-life balance, that’s something we can do that helps the broader market.
And we do that, as you mentioned, in service of a more reliable internet. The world we live in is online, undoubtedly, after the last couple of years, and it’s only going to be more so. And people’s expectations, if you’re an older person like me, you know, maybe you remember downloading AOL for a couple of hours, or when a web page took a minute to load; people’s expectations are much different now. And that’s why the reliability, the performance, making sure things work when we need them to is critical.
Alex Solomon: Absolutely. And I think there’s also a trend that I see and that we’re part of around automation. And automation is a very broad thing, there’s lots of ways that you want to automate manual things, including CI/CD and automated testing and things like that, but I also think about automation in the incident context, like when you have an alert that fires off or you have an incident you have something like that, can you automate the solution or actually even prevent that alert from going off in the first place by creating a set of little robots that are kind of floating around your system and keeping things running and running well and running reliably? So, I think that’s an exciting trend for us.
Mandi Walls: Oh, definitely on board with automating all the things for sure. So, of the things that you’ve learned, what’s one thing that you wish you had maybe learned earlier? Or if there was like a gem or a nugget for folks that might be thinking about starting their own company around developer tools or this kind of software, is there anything that you can share with them?
Alex Solomon: Kolton, you want to go first?
Kolton Andrus: Sure, I’ll go first. I was thinking a little bit about this. If I went back—we’ve only been at about six years, so Alex has the ten-year version. I can give you the five, six-year version. You know, I think coming into it as a technical founder, you have a lot of thoughts about how the world works that you learn are incorrect or incomplete.
It’s easy as an engineer to think that sales is this dirty organization that’s only focused on money, and that’s just not true or fair. They do a lot of hard work. Getting people to do the right thing is tough. Helping with support, with customer success.
Even marketing. Marketing is, you know, to many engineers, not what they would spend their time doing, and yet marketing has really changed in the last 20 years. And so much of marketing now is about sharing information and teaching what we’ve learned as opposed to this old approach of you know, whatever you watched on TV as a kid. So, I think understanding the broader business is important. Understanding the value you’re providing to customers, understanding the relationships you build with those customers and the community as a whole, those are pieces that might be easy to gloss over as an engineer.
Alex Solomon: Yeah, and to echo that, I like your point on sales because initially when I first started PagerDuty, I didn’t believe in sales. I thought we wouldn’t need to hire any salespeople. Like, we sell to other engineers, and if they’re anything like me, they don’t want to talk to a salesperson. They want to go on the website, look around learn, maybe try it out—we had a free trial; we still have a free trial—and put in a credit card and off to the races. And that’s what we did it first, but then it turns out that when doing so, and in customers in that way, there are folks who want to talk to you to make sure that, first of all your real business, you’re going to be around for a while and it’s not—you know, you’re not going to not be around tomorrow.
And that builds trust being able to talk to someone, to understand, if you have questions, you have someone to ask, and creating that human connection. And I found myself doing that function, like, myself and then realized, there’s not enough time in the day to do this, so I need to hire some folks. And I changed my mind about sales and hired our first two salespeople about two-and-a-half years into PagerDuty. And probably got a little bit lucky because they’re technical engineering background type folks who then went into sales, so they ended up being rockstars. And we instantly saw an increase in revenue with that.
And then maybe another more tactical piece of advice is that you can’t focus on culture too early when starting a company. And so one lesson that we learned the hard way is we hired an engineer that was brilliant, and really smart, but not the best culture fit in terms of, like, working well with others and creating that harmonious team dynamic with their peers. That ended up being an issue. And basically, the takeaway there is don’t hire brilliant but asshole folks because it’s just going to cause a lot of pain, and they’re not going to work out even though they’re really smart, and that’s kind of the reason why you keep them around because you think, well, it’s so hard to hire folks. You can’t let this person go because what are we going to do? But you do have to do it because it’s going to blow up anyways, and it’s going to be worse in the long run.
Kolton Andrus: Yeah, hiring and recruiting have their own set of challenges associated with them. And similar to hiring the brilliant jerk, some of the folks that you hire early on aren’t going to be the folks that you have at the end. And that one’s always tough. These are your friends, these are people you work closely with, and as the company grows, and as things change, people’s roles change, and sometimes people choose to leave and that breaks your heart because you’ve invested a lot of time and effort into that relationship. Sometimes you have to break their heart and tell them it’s not the right fit, or things change.
And that’s one that if you’re a founder or you’re part of that early team, you’re going to feel a little bit more than everyone else. I don’t think anything you read on the internet can prepare you for some of those difficult conversations you have to have. And it’s great if everything goes well, and everyone grows at the same rate, everyone can be promoted, and you can have the same team at the end, but that’s not really how things play out in reality.
Julie Gunderson: It’s interesting that we’re talking about culture, as we heard about last week, on the Break Things on Purpose episode, where we also talked about culture and how organizations struggle with the culture shift with adopting new technologies, new ways of working, new tools. And so what I’m hearing from you is focusing on that when hiring and founding your company is important. We also heard about how that’s important with changing the way that we work. So, if you could give an advice to maybe a very established—if you are going to give a piece of advice to Amazon—maybe not Amazon, but an established company—on how to overcome some of those objections to culture change, those fears of adopting new technology. I know people are still afraid of holding a pager and being on call, and I know other people are afraid of chaos as we talk about it and those fears that you’ve mentioned before, Kolton. What would your piece of advice be?
Alex Solomon: Yeah, good—great question. This will probably echo what I’ve said earlier, which is when looking to transform, transform culture especially, and people and process, the way I think about is try to not boil the ocean and start small, and get some early wins. And learn what good looks like. I think that’s really important. It’s this concept of show, don’t tell.
Like, if you want to, you know, you want to change something, you start at the grassroots level, you start small, you start maybe with one or two teams, you try it out, maybe something like I mentioned before, in a greenfield context where you’re doing something brand new and you’re not shackled by legacy systems or anything like that, then you can build something new or that new system using the new technologies that are that we’re talking about here, whether it’s public cloud, whether it’s containerization and Kubernetes, or whatnot, or serverless, potentially. And as you build it and you learn how to build it and how to operate it, you share those learnings and you start evangelizing within the company.
And that goes to what I was saying with the show don’t tell where you’re like showing, “Here’s what we did and here’s what we learned. And not everything went swimmingly and here are things that didn’t go so well, and maybe what’s our next step beyond this? Do other folks want to opt-in to this kind of new thing that we’re doing?” And I’m sure that’s a good way to get others excited. And if you’re thinking about longer-term, like, how do you transform the entire company, well, that’s this is a good way to start; start small you learn how to do it, you learn about what good looks like, you get others excited about it, others opt-in, and then at some point through that journey, you start mandating it top-down as well because grassroots is only going to take you so far. And then that’s where you start putting together project plans around, like, how do we get other teams to do it, on a timeline? And when are they going to do it? And how are they going to do it? And then bring everyone along for the journey as well.
Kolton Andrus: You’re making this easy for me. I’ll just keep agreeing with you. You hit all the points. Yeah, I mean, on one hand, the engineer in me says, you know, a lot of times when we’re talking about this transformation, it’s not easy, but it’s worth it. There’s a need that we’re trying to solve, there’s a problem we’re trying to solve.
And then the end, what that becomes as a competitive advantage. The thought that came to mind as Alex was speaking is you need that bottoms-up buy-in; you also need that top-down support. And as engineers, we don’t often think about the business impact of what we do. There’s an important element and a message I like to reiterate for all the engineers that, think about how the business would value the work you do. Think about how you would quantify the value of the work you do to the business because that’s going to help that upper level that doesn’t, in the day-to-day feeling the pain, understand that what we’re doing is important, and it’s important for the organization.
I think about this a little bit like remote-by-default work. So, when we founded Gremlin, we decided you know, we didn’t want offices. And six years ago that was a little bit exceptional. Folks were still fundamentally working in an office environment. I’m not here to tell you that remote-by-default is easy, works for everyone, or is the answer.
Actually, what we found is you need a little bit of both. You need to be able to have good tooling so folks can be efficient and effective in their work, but it’s still important to get folks together in person. And magic happens when you get a group of folks in a room and let them brainstorm and collaborate chat on the way to launch or on the way to dinner. But I think that’s a good example where we’ve learned over the last couple of years that the old way of doing it was not as effective as it could be. That maybe we don’t need to swing the pendulum entirely the other way, but there’s merits at looking at what the right balance is.
And I think that applies to, you know, incident management, to SRE, to Chaos Engineering. You know, maybe we don’t have to go entirely on the other end of the spectrum for everyone, but are there little—you know, is there an 80/20 solution that gets us a lot of value, that saves a lot of time, that makes us more efficient and effective, without having to rewrite everything from scratch?
Alex Solomon: Yeah, I like that a lot. And I think part of it, just to add to that, is make it easy for people to adopt it, too. Like, if you can automate it for folks, “Hey, here’s a Terraform thing where you could just hit a button and it does it for you, here’s some training around how to leverage it, and here’s the easy button for you to adopt.” I think that goes with the technology of adopting, but also the training, also the, you know, how-tos and learnings. That way, it’s not going to be, like, a big painful thing, you can plan for it. And yeah, it’s off to the races from there.
Kolton Andrus: I think that’s prudent product advice, as well. Make it easy for people to do the right thing. And I’m sure it’s tricky in your space; it’s really tricky in our space. We’re going out and we’re causing failure, and there’s inadvertent side effects, and you need to understand what’s happening. It’s a little scary, but that’s where we add a lot of value.
We invest a lot of time and effort in how do we make it easy to understand, easy to understand what to expect, and easy to go do and see what happens and see that value? And it sounds easy. You know, “Hey, just make it easy. Just make it simple,” but actually, as we know, it takes so much more effort and work to get it to be that level of simplicity.
Alex Solomon: Yeah, making something easy is very, very hard—
Kolton Andrus: [laugh].
Julie Gunderson: —ironically.
Kolton Andrus: Yeah. Ironically.
Mandi Walls: Yeah, so what are you excited for the future? What’s on your horizon that maybe you can share with us that isn’t too, like, top-secret or anything? Or even stuff, maybe, not related to your companies? Like, what are you seeing in the industry that really has you motivated and excited?
Alex Solomon: Great question. I think a couple of things come to mind. I already mentioned automation, and we are in the automation space in a couple of different ways, in that we acquired a company called Rundeck over a year ago now, which does runbook automation and just automation in general around something like running a script across a variety of resources. And in the incident context, if an alert fires or an incident fires, it’s that self-healing aspect where you can actually resolve the issue without bothering a human.
There’s two modes to this automation: There’s the kind of full self-healing mode where, you know, something happens and the script just fixes it. And then the second mode is a human is involved, they get paged, and they have a toolbox of things that they can do, that they can easily do. We call that the Iron Man mode, where you’re getting, like, these buttons you can push to actually resolve the problem, but in that case, it’s a type of problem that does require a person to look at it and realize, oh, we should take this action to fix it. So, I’m very excited about the automation and continuing down that path.
And then the other thing that really excites me as well is being able to apply AI and ML to the alerting and incident response and incident management space. Especially our pattern detection, looking for patterns and alerts and incidents, and seeing have we seen this kind of problem before? If so, what happened last time? Who worked on the last time? How did they resolve it last time?
Because, you know, you don’t want to solve the same problems over and over. And that actually ties into automation really nicely as well. That pattern detection, it’s around reducing noise, like, these alerts are not real alerts, they’re false alerts, so let’s reduce them automatically, let’s suppress them, let’s filter them out automatically because the signal to noise is really important. And it’s that pattern detection, so if something major is happening, you can see here’s the blast radius, here’s the services or systems it’s impacting. Oh, we’ve seen something similar before—or we haven’t seen something similar before, it’s something totally brand new—and try to get the right folks involved quickly so that they can understand that blast radius and know how to approach the problem, and resolve it quickly.
Kolton Andrus: So, it’s not NFT’s is your PagerDuty profile picture?
Alex Solomon: [laugh].
Kolton Andrus: Because that’s, kind of, what I—no, I’m kidding. I couldn’t help but just like what do I not see—like, I’ve, I’ve tried to think of the best NFT joke I could. That was what I came up with. I agree on the AI/ML stuff. That opportunity to have more data and to be able to do better analysis of it, I’ve written some of that, you know, anomaly detection stuff—and it was a while back; I’m sure it could be done better—that’ll get us to a point.
You know, of course, I’m here to push on the proactive. There’s things we can do beyond just reacting faster that will be helpful. But I think part of that comes from people being comfortable sharing more about their failures. It’s a stigmata to fail today, and regardless of whether we’re talking about a world where we’re inciting things like blameless postmortems, people still don’t want to talk about their failures, and it’s hard to get that good outage information, it’s hard to get the kind of detail that would let us do better analytics, better automation.
And again, back to the conversation, you know, maybe we know what Amazon and Netflix looks like, but for us to create something that will help solve a broader problem, we have to know what those companies are feeling in pain; we need to know what their troubles are hitting at. So, I think that’s one thing I’ve been excited about is over the past two years, you’ve seen the focus on reliable, stable systems be much more important. Five years ago, it was, “Get out of my way, I got features to write, we got money to make, we’re not interested in that. If it breaks, we’ll fix it.” And you know, as we’re looking at the future, we’re looking at our bridges, we’re looking at our infrastructure, our transportation, the software we’re writing is going to be critical to the world, and it operating correctly and reliably is going to be critical. And I think what we’ll see is the market and customers are going to catch up to that; that tolerance for failure is going to go down and that willingness to invest in preventing failure is going to go up.
Alex Solomon: Yeah, I totally agree with that. One thing I would add is, I think it’s human nature that people don’t want to talk about failures. And this is maybe not going to go away, but there is maybe a middle ground there. I mean, talking about postmortems, especially, like, when a big company has a big outage and it makes the news, it makes Hacker News, et cetera, et cetera, I don’t see that changing, in that companies are going to become radically more transparent, but where I do think there is a middle ground is for your large customers, for your important customers, creating relationships with them and having more transparency in those cases. Maybe you don’t post it on a public status page a full, detailed nitty-gritty postmortem, but what you do do is you talk to your major customers, your important customers, and you give them that deeper view into your systems.
And what’s good about that is that it creates trust, it helps establish and maintain trust when you’re more transparent about problems, especially when you’re taking steps to fix them. And that piece is really important. I mean trust is, like, at the core of what we do. I have a saying about this—[unintelligible 00:19:31]—but, “Trust is won in droplets and lost in buckets.” So, if you have these outages all the time, or you have major service degradation, it’s easy to lose that trust. So, you want to prevent those, you want to catch them early, you want to create that transparency with your major customers, and you want to let them in the loop on what’s happening and how you’re preventing these types of issues going forward.
Kolton Andrus: Yeah, great thoughts. Totally agree.
Julie Gunderson: So, for this episode of deep thoughts with Kolton and Alex, [laugh] I want to thank both of you for being here with Mandi and I today. We’re really excited to hear more and to see each of our respective companies grow and change the way people work and make life easier, not just for engineers, but for our customers and everybody that depends on us.
Mandi Walls: Yeah, absolutely. I think it’s good for folks out there to know, you’re not alone. We’re all learning this stuff together. And some folks are a little further down the path, and we’re here to help you learn.
Kolton: Totally. Totally, it’s an opportunity for us to share. Those that are further along can share what they’ve learned; those that are new or have some great ideas and suggestions and enthusiasm, and by working together, we all benefit. This is the two plus two equals five, where, by getting together and sharing what we’ve learned and figuring out the best way, no one of us is going to be able to do it, but as a group, we can do it better.
Alex Solomon: Yeah. Totally agree. That’s a great closing thought.
Mandi Walls: Well, thanks, folks. Thank you for joining us for another episode of Page it to the Limit. We’re wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. That’s @pageit2thelimit. Let us know what you think of the show. Thank you so much for joining us, and remember uneventful days are beautiful days.
Time for a cross over! Today Page it to the Limit host Mandi Walls, DevOps Advocate at PagerDuty and Julie Gunderson, Reliability Advocate at Gremlin host a special two-parter. They are interviewing Kolton Andrus, co-founder of Gremlin and Alex Solomon, co-founder of PagerDuty. Each of them share the origins of their respective companies, both of which began in their respective work at larger organizations. Kolton and Alex reflect on how they identified the space where they could build their respective companies and the shift from larger entities to start ups. Each of them offer up some excellent insight!
Kolton is the founder and CTO of Gremlin. Previously he was CEO, building the initial product, team, and company to $10M in ARR. Before founding Gremlin, Kolton was an engineer and leader at both Amazon and Netflix, building their internal Chaos Engineering platforms and guiding the implementation of the practice leading to another ‘nine’ of uptime for both companies. Early in his career, Kolton worked for several start-ups and holds a BS and MS in Computer Science.
Alex Solomon is the Co-founder and General Manager of PagerDuty. He advocates for growing the community of PagerDuty practitioners by sharing best practices that advance real-time operations. Alex started PagerDuty in 2009 as founding CEO, leading the company through several stages of growth, including product-market fit, fundraising, and expansion of the product vision to what is now the PagerDuty Operations Cloud. He has served on the PagerDuty board of directors since 2010.
Mandi Walls is a DevOps Advocate at PagerDuty. For PagerDuty, she helps organizations along their IT Modernization journey. Prior to PagerDuty, she worked at Chef Software and AOL. She is an international speaker on DevOps topics and the author of the whitepaper “Building A DevOps Culture”, published by O’Reilly.
Julie Gunderson is a Sr. Reliability Advocate at Gremlin, where she works to further the understanding of reliability and the adoption of Chaos Engineering principles and methodologies. Julie is actively involved in the DevOps, Reliability, and Chaos Engineering space and is passionate about helping individuals, teams, and organizations understand how to leverage best practices and develop amazing cultures. Julie is also a founding member of DevOpsDays Boise. When Julie isn’t working, she is most likely making jewelry out of circuit boards, or traipsing around the mountains in Idaho.