Julie Gunderson: Welcome to Page It To The Limit, a podcast where we explore what it takes to run software and production successfully. We will cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, Julie Gunderson, @Julie_Gund on Twitter. Today, we are talking with Corey Quinn, who was recently named by Analytica, the Number One Cloud Influencer for 2019. Corey, you love to needle PagerDuty and so it makes sense to have you on our podcast today. I’d love to hear from you, what would you like to talk about?
Corey Quinn: Great question. So Analytica wound up running some algorithm that populated out the top 100 influencers across a wide variety of different areas, and they picked cloud as one of those focus areas and I came out as number one. Since that happened I’ve been completely insufferable. No matter who I’m talking to, it doesn’t matter if I’m ordering toast, that’s the sort of thing that will work directly in the conversation. The fun part about that is it gives me completely honor and credibility in a lot of weird spaces, which is nuts because honor and credibility has really been my stock and trade but now people are actually starting to go along with my mass delusion. So, yeah, you’re going to have to suffer through my slings and arrows a little bit longer internet because the worst thing you’ve done is given me positive reinforcement on my nonsense. Now, you’re right, I have been a fan of PagerDuty for a long time, largely because I remember a time of working on systems before PagerDuty existed where, oh, it’s time to go ahead and configure Nagios escalations by hand. It was the original call of duty for lack of a better term, because 2:00 in the morning something’s gone off, oh, it wasn’t actually something that you cared about but you forgot to put a silence notification in. Getting anything like a rotation established was brutal and it was more or less straining raw sewage with your teeth, which I generally tend to reserve for integrating it with SNMP but I digress. So when PagerDuty came out it was really a breath of fresh air because it was a centralized component in a alerting and monitoring tool chain. And the single thing that PagerDuty did spectacularly was it woke people up at the right time, the right people at the right time when something is melting. And that was awesome and I would have liked it a lot more except that as I went through my career I wound up in an actively abusive on-call situation. It went from a rotation of four people to two when one person transferred out and my Nimrod, the boss decided that, well, he was always on call so he was not going to go ahead and participate the rotation. So, suddenly 50% of my life was on call and the app liked to break a lot. This wound up causing a very adversarial relationship for me and being on call for the rest of my career to the point where now I fixed the horrifying AWS bill as a core function of the company I built a mission. And people say why did I focus on that? Everyone assumes the answer was going to be, “Oh, I had this grand vision for how billing could be different.” Well, sure, that is a good story but here in reality the answer to that is I wanted a problem that was strictly business hours, the end.
Julie Gunderson: So now that you’re in business hours only and you’re focusing on those business outcomes, how do you see it tied together? Do you see a tie together?
Corey Quinn: Well, I see that business winds up being one of those things that encompasses a whole lot of different disciplines, engineers like to call the business sort of the catch-all term for anything that is not explicitly engineering. Sometimes it’s marketing, sometimes it’s product, sometimes it’s strategy, sometimes it’s compliance, sometimes it’s finance, just getting finance has never brought into the conversation until it’s far, far, far too late. But one of the interesting pieces that we see around all those different business functions is that in most companies, engineering is usually the only group that needs to be awake at 2:00 in the morning in order to deal with an emergent event. I mean, sure, there are exceptions to this when there’s a crisis specifically in the arena that accompany operates within, or there’s a giant business event that needs to be addressed, or you’re doing a midnight launch of something. Great, but by and large, the only people in most companies that get woken up unexpectedly in the middle of the night have been engineering. One of the things I’m starting to wonder is, is this the right path forward? I would make the argument that if it is that critical for a business to have its systems running 24/7, then it is worth staffing accordingly to the point where you can have people fielding some of those issues in a either a follow-the-sun rotation or in shifts. Well, that gets super expensive. Well, yes, so does burning people out who are waking up at the drop of a hat to go and fix something where they’re in many cases not even empowered to fix the thing that woke them up in the event of a false positive. There’s a lot of crappy culture out here in the world of alerting, monitoring, and waking people up. And I’m starting to see a change in that but change comes slowly to larger organizations.
Julie Gunderson: I love that. And take us back a little because you left a job because of a horrible on-call rotation schedule, right? Two people, with an app that was constantly breaking. Aside from the advice of don’t do that, what advice would you give to companies that are trying to do this model because that cost of replacing those engineers, the cost of human sanity is extremely high. So how do you break those cycles?
Corey Quinn: Every time you’re waking someone up you can contextualize it as pushing someone one step closer to quitting. So the question is, is when you take a look at the past month or so, every time someone got paged out of hours, which one of those were truly business-critical and which ones were just noise, start turning things off as far as alerts go with reckless abandon. If something’s broken at suboptimal, great. Unless it’s directly revenue-impacting it can probably wait until morning or the next business day. That is a sometimes challenging argument to make to various stakeholders. But the response to that becomes a pretty easy, “Okay, I see where you’re coming from. How about you go ahead and carry the pager,” and suddenly people’s philosophy changes. I’m a big believer in that the only thing that teaches people how to be humane in the circumstance is pain, and the only way to get them to experience the pain is to share it.
Julie Gunderson: That’s fair enough, that’s great advice. Why don’t you debunk a myth for us? And Corey, you can pick any myth. If you want to keep talking about on-call, that’s fantastic. If you want to talk about what AWS bills look like, that’s fantastic. What would you like to debunk for us today?
Corey Quinn: Sure. But let’s look back at the myth that I guess is the central conceit of AWS Billing. People talk about how the bill is too expensive, they need to cut the spend that they’re putting on AWS. And people immediately jump in with a suite of services and tools and offerings that wind up addressing that, but that’s not usually the real pain. The pain as articulated by the business, now you’ve got me doing it too, is that someone wound up receiving a bill from Amazon that was 20% higher this month than last month’s. And okay, great, their first response is how many books is engineering buying because contextualizing what the Amazon bill is for someone doesn’t work with technology is not exactly a straight shop. Then you’re playing a game of corporate telephone. And by the time it gets to the person five communication cycles later, who has spun up a cluster that’s now costing an arm and a leg, what they hear is, “You’re spending too much money, stop doing it.” The question that’s actually being asked in most of these cases is, okay, so the bill is now 20% higher, is that a one-off? Is that the new normal? Which cost center does that get allocated against? Is that going to impact our 18-month projections? How do we predict and plan for that sort of event? Now, to tie this back to the previous conversation, some people have the bright genius idea of, “I know we’re going to tie the AWS Billing system into PagerDuty.” Don’t do that, never do that.
Julie Gunderson: Please go on, tell us more.
Corey Quinn: Well, the billing system operates on as best we can tell an eight-hour eventual consistency model. So holy crap, the bill spiked eight hours ago is the last thing you want to hear at two o’clock in the morning. There’s a story about what that looks like and how that can be addressed. And then you’re also going to run into all of the enormous scary things where the first of the month hits and suddenly all the fixed fee stuff that you have be it partial reservation payments, or a payment upfront for, did I shield the advanced feature, that winds up suddenly causing all the alarms to go off because it assumes that, “Wow, you spent a giant pile of money on the first of the month, you’re probably going to spend that every day of the month,” so your prediction is just Looney Tunes-high, so that’s part of it. Secondly, the bill is always a trailing function. Being able to figure out what’s driving that cost is going to be something that takes more analysis than you’re going to get but waking someone up. And it’s in-depth analysis, it’s finicky analysis, and it’s not analysis based on three o’clock in the morning when you have someone angry staring at the billing database.
Julie Gunderson: Very fair, very fair. What is your advice then? I mean, I have heard about people alerting on their bills and like you said, it doesn’t make any sense to wake somebody up at three o’clock in the morning. What do you recommend?
Corey Quinn: I would be fine with that going to a Slack channel, for example, during business hours. One thing that I would love Slack to implement is the idea of a scheduled send. This drives me nuts when for example, my business partner has a great idea over the course of a weekend or at two o’clock in the morning, I want to be notified about that, let’s say, I don’t know, 10 o’clock on Tuesday if he’s having these ideas during the course of a weekend. Instead it winds up showing up directly in the off hours, which trains me A, not to look at anything he sends me because it’s all work stuff, and B, when I show up on Monday morning I have a giant back scroll I need to go through and sometimes declare bankruptcy on. This is a common problem that manifests itself in the area of alerts as well, where people wind up. If you wind up getting things pushed to you and demanding your attention that aren’t actually important, people are training themselves to ignore it.
Julie Gunderson: Absolutely, that’s one of the reasons that we say don’t send it to distribution lists, right? That gets put in a folder and your alerts go unnoticed because people learn to ignore the things that aren’t important.
Corey Quinn: It’s stupider than that. Because in many cases you wind up with, for example, a G Suite account that runs for your organization. Every one of those alerts that smacks into your various folks as distribution lists and winds up in their inbox and never gets read counts against API quota limits for number of emails that can be received domain-wide over a fixed period of time. So if something blows up and this is hitting your email, surprise, suddenly you’re not getting actual important emails for a while until everything stabilizes.
Julie Gunderson: Yep, absolutely. And again, you brought it up. Every alerts that wake somebody up should be humanly-actionable and important to the business at that moment, otherwise it should be handled during business hours.
Corey Quinn: These need to be actually emergencies, things that are actually damaging to the business if they don’t get fixed. There are exceptions to this but everything that wakes someone up needs to be viewed as a bug because it is one of the most expensive commodities that we have, which is people’s patience. And as those things start eroding what people are able to focus on because they’re not getting a good night’s sleep, they’re worried about the next page going off, you wind up with terrible patterns. I’ve talked to a number of folks from very large, very admired tech companies, and they wind up on call for a few days out of every quarter for a particular service and their response is just they don’t plan to sleep for those few days, and they just wait until the storm passes and then they can hand it off to someone else. That’s freaking terrible. Why in the world would you do that to people? Fix the problem?
Julie Gunderson: Yeah, tune the service, right? You’re letting something sit there and destroy morale, cause insanity among your engineers, and in general is bad for your business, it’s bad for your customer. It’s a bad thing all around. But when you say that, when you say, “Do something about it,” what do they say back? What’s the why not?
Corey Quinn: Again, it’s easy to sit here and say this in the abstract as far as you should fix the thing that wakes people up but there’s a nuanced story around this. Every environment is different, there’s always concerns that need to be caught. Things that wake people up now are in many cases there because once upon a time they didn’t wake people up and it caused a massive outage, that becomes a challenge. We mentioned the story of working in an on-call environment that got terrible and I left for another job that paid the same. Professional advice, never make a lateral move and not make more money out of the deal, pro tip. But I did it because I was desperate to get out and their on-call rotation was simultaneously far better and far more monstrous.
Julie Gunderson: Well, absolutely. Because why? It left one person?
Corey Quinn: No, no. This was a reasonable rotation with enough people. What made it terrible was that the VP in charge of engineering was a big fan of a software package called Big Brother. Okay, fair, it’s like Nagios only somehow worse. Then this same VP was also a big fan of AWS. Now, back in those days the AWS connectivity to the rest of the internet and the latency issues were not as good as they are now. So the answer was because we wound up getting a bunch of false positives against our service for latency reasons when the Big Brother installation in AWS wound up firing off. So the answer that this brain trust came up with was we’re going to have three of them running simultaneously. And when you get paged, you don’t have to do anything about it until you get three pages in which case now you know it’s real. Now, people were about to walk out that day and what they did instead was, okay, between the hours of, I think it was 10:00 PM and 6:00 AM, all monitoring and alerting was suppressed because they wanted people to have a good night’s sleep. Okay, that is absolutely one way to approach this but you’re telling me then there is absolutely nothing that could ever possibly justify waking someone up out of hours because that’s how you’re configured now? I mean, I definitely got a good night’s sleep there but I lost sleep worrying about the long-term viability of the company. Surprise, they’re out of business now.
Julie Gunderson: Let’s talk a little bit about a recent podcast that we just had. We talked about owning your code and production, owning your services. How have you brought this in to companies?
Corey Quinn: Generally speaking I tend to come at this from a perspective of advisory only. The challenge of trying to drive cultural transformation as an outsider is that, one, it’s nearly impossible, and two, it’s nearly impossible to measure which means you can start charging extortionate piles of money to do it and never really get anywhere. That’s never been the type of business I wanted to run, but cultural transformation is really what’s required to drive an awful lot of these larger scale changes. For better or worse, that’s not something that you can generally get without extreme levels of executive sponsorship. And you’re not going to find folks who are effective at driving those changes with a Google search. So that tends to be an area where you can have a whole bunch of folks telling stories about what it’s going to take in order to change their culture, but without buy-in you’re never going to get anywhere. So mostly I stay away from it and wind up talking specifically about the bill for the same reason I don’t do security assessments, compliance auditing, or legal review.
Julie Gunderson: Now, that really does sound great, Corey. Tell us what you think about dogmas and systems that people have in place just because it’s something that they were told to do. For example, is there something that you see a process that you see running over and over that these executives just believe in that maybe it’s just not the right thing but they’re doing it because we’ve always done it that way?
Corey Quinn: There’s lot. But there’s also the counter-argument where here in Silicon Valley, one of the ideas that is always gaining traction is that we’re smarter, better-looking and all around better than people who started companies before. So we’ve disrupted this one industry that we’re in, so now we’re going to disrupt everything else. It turns out that everyone else in the planet is terrible at doing job interviews, is terrible at employee compensation, is terrible at stating earnings appropriately, et cetera, et cetera, et cetera. And we try and reinvent everything from this point of first principles. There’s this ongoing narrative within the space of if you’re not reinventing everything that you’re doing then there’s something awful that’s happening. But we see this in other arenas too. We can even see in how we talk about “legacy code.” When you say something is legacy I hear revenue-generating. And yeah, well, it’s old and ancient you can tell it’s crap because it wasn’t written on the JavaScript framework that was released within the last six months. Yes, it’s also the bank’s transaction system, no touchy. There’s a very strange, I guess, attitude of, we just know better even in the complete absence of context when we’re dealing with talking about established environments. There’s also then as a result this idea that if you’re not doing what the latest whizzbang tech companies are doing, Kubernetes I’m looking at you, then you’re clearly falling behind and should be absolutely put out to pasture. That’s wrong. What you have done has worked to get you to where you are, whoever you happen to be, throwing the entire thing out because someone on a conference stage somewhere says there’s a different better way to do it, approach it with a bit more skepticism.
Julie Gunderson: Well, absolutely. And that’s kind of the start where you are, right? You don’t have to change everything that you’ve done if you’re looking to innovate and move forward, you can start where you are and look forward. And then if you need to make those changes on those legacy systems, after careful evaluation you can go ahead and do that, but change for the sake of change isn’t very productive.
Corey Quinn: Unless you wind up making your living by profiting off of change, in which case it’s essential.
Julie Gunderson: Very fair.
Corey Quinn: Always question the motives that people trying to sell you a thing.
Julie Gunderson: Well, everybody is trying to sell you a thing, Corey.
Corey Quinn: So it would seem. There are people who are selling things and people who don’t realize they’re selling things, that’s really the only big question in some cases.
Julie Gunderson: All right. I’m going to agree with that. All right, tell us, what’s the one thing that you wish you would’ve known sooner? If you could look back into your career 14 years in the past, what’s the thing that you would’ve told past Corey?
Corey Quinn: Honestly, the fact that everything that held me back in corporate life, namely my personality, would one day be the thing that made me the greatest cloud influencer in the world. It really comes down to finding your own path forward. Of being able to say with something of a certainty that whatever works for you doesn’t necessarily need to work for other people, and this applies to everyone, not just me. If there are things that make you unique among folks and everyone craps all over it all the time, question if maybe there’s a way forward. Now, be careful with that bit of wisdom because it could very easily turn into empowering something absolutely monstrous and that’s not my intention in the least. But it is worth considering from a perspective of what is the thing that makes you special that makes you stand out and gives you a unique voice. For me it was my snark, my sense of humor, my complete lack of respect for authority, and that caused massive freaking problems in the course of my career. For better or worse now it has indeed made me completely unemployable but on the other side of the coin it did empower me to do this ridiculous influencer thing/consulting thing.
Julie Gunderson: Corey, I would also add that your suits maybe have been one of those things that people said don’t do. And at the end of the day after knowing you all of these years that we’ve known each other you really have embraced who you are. And for people who struggle with that, what do you suggest they do? Do you suggest they leave an organization? Do you suggest they go out and try to be Corey Quinn and start something new or pick fights on Twitter?
Corey Quinn: That’s the challenge, is when you find yourself marching to the beat of a different drum then trying to pattern what you’re doing after what other people have done already, means that at best you’re going to be the crappy second place version of that person because you’re never going to be better at being them than they are. I looked for and could not find examples of people doing the snarky, sarcastic thing in an economically-successful way. Or the few folks who tried, an easy example of this is The Register, a journalist’s/tech rag that’s been doing this for almost 30 years, where they’re very snarky and sarcastic and they go after pretty much everyone with snark and cynicism and as a result they find themselves uninvited from an awful lot of events, companies won’t speak to them. Well, yeah, no kidding, there has to be something uplifting and you can only ever punch up and only to a limited point, otherwise you’re not really warming up for people, instead you’re driving folks away where you otherwise might want to get them to embrace what you’re doing.
Julie Gunderson: Absolutely. All right, well, now let’s talk back about running software and production. What are you glad I didn’t ask you about today?
Corey Quinn: That’s an interesting question. I’d say probably the best thing is what software are you running in production today? Because the answer is nothing that is critical path. The newsletter production system that I built is over-engineered and ridiculous, but if that were to go down due to a variety of different things, great, worst case I’m writing next Monday’s email newsletter by hand. There’s nothing else that I run that is critical path because when you’re providing advisory services, you generally don’t need to have the same constraints as a SaaS platform. That said, we have an awful lot of analytics tools that we wind up using that consume and slice and dice various AWS environments, their bills, et cetera. But that looks a lot more like a series of scripts and a hosted app that we run in a serverless context, it isn’t running software in the traditional sense. It’s a push button, it spins up an environment, runs it down analytics and turns itself off. If that ever goes away for a protracted period of time there’s something systemic happening. So I sleep super well at night from a software point of view, but I am rather glad that you didn’t ask me for tips and tricks on running software intelligently in production because my easy answer is don’t.
Julie Gunderson: Talk to me a little bit about analytics. Do you oftentimes see big organizations not using analytics to the best of their ability?
Corey Quinn: Finding a single large organization that is using analytics to the best of their ability, and we have a case study for something grand, everyone is struggling with analytics. There’s also this occasionally misunderstood belief that all of your giant pile of data is incredibly useful to your business. This school of thought is generally advocated by folks who stand to make a tidy profit from charging you for storage for that giant pile of data. And if you take a look at AI/machine learning approaches, they can sort through huge piles of data and find anything except a viable business model. It costs a crap ton of compute, it costs a crap ton of storage, and one wonders why all of the cloud providers are super behind that entire movement. But I’m not seeing definite business value come out of it yet, it feels like it’s almost entirely hype, similar to blockchain. Now, getting analytic data out of things that matter to your business and our material, well, that’s a bit of a different story but that comes down to being able to understand what’s going on in your business. ERP systems were a big approach towards this in mostly at manufacturing companies but there’s a lot of other folks that are using similar things for various reasons, to more or less provide dashboards into what’s going on through every aspect of very large enterprises. You need something like that. I don’t pretend to say, “Oh, all of the ERPs are useless,” I just know that if we ever get to a point where I have to install one for a company I’m at it’s time for me to find a smaller company.
Julie Gunderson: All right, Corey, the last thing that I would like to ask you today is again, the Number One Cloud Influencer for 2019, what does 2020 hold for you?
Corey Quinn: Well, that’s the question. How do you get better than number one as being a cloud influencer? Maybe I’ll try serverless influencer, maybe I can become a blockchain influencer despite doing nothing with blockchain other than making fun of it. But honestly it comes down to attempting to come up with better stories to tell, find different venues to tell those stories in. I continue to have an ongoing love affair with the sound of my own voice so I will continue doing my own podcast namely, Screaming in the Cloud and the AWS Morning Brief. but there’s a lot of other things I have planned as well, various stunts. Everyone thinks I’m kidding when I say this but I do in fact have a line item in the company budget that is labeled Spite. I look forward to tapping the Spite budget in the months to come.
Julie Gunderson: Thank you, Corey. And thank you for taking the time to be on Page It To The Limit. With that, this is Julie Gunderson, wishing you an uneventful day.
Corey Quinn: (singing).
Julie Gunderson: That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit using the number 2. That’s @pageit2thelimit. Let us know what you think of the show. Thank you so much for joining us and remember, uneventful days are beautiful days.
Corey discusses how he became a cloud influencer, what that means for his credibility, and how people now listen to him.
Corey: “I have been a fan of PagerDuty for a long time, largely because I remember a time working on systems before PagerDuty existed.”
Corey discusses why waking the right people up at the right time makes it much less of a “Call of Duty” situation.
Corey wanted a problem that was strictly business hours only, Julie and Corey discuss how it all ties together with being on call.
Corey: “One of the interesting pieces that we see around all of those different business functions is that in most companies, engineering is usually the only group that needs to be awake at two in the morning that needs to deal with an emergent event.”
Corey goes on to explain that there are exceptions to this rule, and how businesses should staff accordingly if it is a 24 / 7 critical system, and how empowering folks to fix things that are broken is the start of cultural change.
Corey: “Everytime you wake someone up, you can contextualize it as pushing someone one step closer to rage quitting.”
Corey talks about looking back and removing the noise by understanding and analyzing what is alerting, and knowing when to wake people up.
Corey debunks the myth that is the “central conceit of AWS billing”.
Corey: “People talk about how the bill is too expensive… The pain as articulated by the business, is that someone wound up receiving a bill from Amazon that was 20% higher this month than last month… contextualizing what the Amazon bill is for someone that does not work with technology is not exactly a straight shot.”
Corey goes on to tell us what kinds of questions to ask when the bill is 20% higher, and how tying the AWS bill into PagerDuty is a “DON’T DO THAT” situation, due to how billing works at AWS, as the bill is always a trailing function which requires more analysis than waking someone up at 3am.
Advice from Corey: “I would be fine with that going to a slack channel during business hours.”
Corey and Julie talk about how people get trained to ignore alerts, either through distribution lists or API quota limits.
Corey: “If something blows up and this is hitting your email, surprise suddenly you’re not getting actual important emails for a while until everything stabilizes.”
Corey talks about how alerts that wake people up should be emergencies, things that are actually damaging to the business if they don’t get fixed… with exceptions. Corey goes on to talk about how peoples patience can get eroded and terrible patterns can emerge.
Corey talks about the nuanced story behind why things don’t get fixed, and why it becomes a challenge to tune services.
Corey: “What made it terrible [the on-call rotation] was that the VP in charge of engineering was a big fan of a software package called Big Brother… this VP was also a fan of AWS… the answer that the braintrust came up with was to have three of them running simultaneously, so when you get paged you don’t have to worry unless you get paged by all three of then and then you know it’s real.”
Corey about how driving cultural transformation does require executive sponsorship and that buy-in will get you everywhere.
Julie and Corey discuss ingrained dogmas and Corey talks about how we try to reinvent everything, and how that isn’t necessarily.
Corey: “There’s this ongoing narrative within this space of if you are not reinventing everything you are doing then there’s something awful that’s happening.”
Corey goes on to discuss how you don’t have to throw everything out you have done to be successful.
From suits to personality, Corey talks about what he wishes he knew years ago. Corey talks about owning the “thing that makes you special”, and finding your niche.
Corey is the Cloud Economist at The Duckbill Group. Corey specializes in helping companies improve their AWS bills by making them smaller and less horrifying; hosts the Screaming in the Cloud and AWS Morning Brief podcasts; and curates Last Week in AWS, a weekly newsletter summarizing the latest in AWS news, blogs, and tools, sprinkled with snark.
Julie Gunderson is a DevOps Advocate on the Community & Advocacy team. Her role focuses on interacting with PagerDuty practitioners to build a sense of community. She will be creating and delivering thought leadership content that defines both the challenges and solutions common to managing real-time operations. She will also meet with customers and prospects to help them learn about and adopt best practices in our Real-Time Operations arena. As an advocate, her mission is to engage with the community to advocate for PagerDuty and to engage with different teams at PagerDuty to advocate on behalf of the community.