Managing Vendor Incidents with Jeff Martens

Transcript

Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system and reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter.

Mandi Walls: Alright folks, welcome back to Page It to the Limit. This week I have with me Jeff Martens. So we are going to talk about vendor outages and when this happened, CrowdStrike in July, the first thing I thought was we’re going to have to cover this on the podcast. And the second thing I thought was I want to talk to Jeff about this because he is the man who has been thinking about this very deeply for a number of years and will have capital-O opinions and that’s what we want. Jeff, welcome back to the show.

Jeff Martens: Thank you for having me. Yeah, I’m excited to talk about something that I’ve been thinking about a lot over the years and something I care about a lot and something we now are starting to realize is a pretty big issue when it comes to software reliability.

Mandi Walls: Absolutely. You start to feel like a broken record because when we first started moving things to the cloud, problem number one is you own your own availability. It’s still your customers demand performance from you regardless of what your choices are in your platforms and everything else. I feel like we’ve been trying to escape from that a little bit, but at the same time it’s become harder and harder and harder when you open your Okta panel and you have 70 different things that you can log into and all these dependencies and all of these tools and they’re great and it speeds things up immensely. But at the same time when you’re looking at nines, they multiply it together and they go down the more you add.

Jeff Martens: Yeah, it’s interesting because all of the vendors that we rely on are typically also SaaS companies themselves, which means they have a ton of vendors and so that has a few different implications. So one is there’s this idea of chain dependencies and we could talk all day about that. The other thing though is that one of the things I find is a lot of companies expect a lot from their vendors, but then when it comes to their customers, they still take the position of power where they say, look, we’re going to set the SLA, we’re going to decide what an outage means. We’re going to force you to prove it. We’re not going to admit to it on our own. And so I really think there’s an opportunity for us to start looking at what is the reality of how we build software and the fact that in a way we’re kind of each other’s customers and we should be treating each other with the same respect that we expect from our own vendors. And that’s something that I’ve always kind of scratched my head at a little bit is why say one thing as a vendor and say a different thing as a customer.

Mandi Walls: Absolutely. And us too, I mean a lot of our dependencies are other SaaS providers. We have a lot of places mutually beneficial relationships with them. We use their stuff, they use our stuff, and then it feels like potentially dominoes or house of cards, one thing sort of flips out and a whole lot of folks are impacted and it just flows down stream.

Jeff Martens: And to go back to what you started with by saying we own our own availability, I 100% agree with that. And I always like to give props to somebody that helped me really grasp this. And that’s Tim Armandpour, the CTO of PagerDuty. When I worked at PagerDuty many years ago, one of the early engineering or product meetings that he was leading, he said, look, we will not ever blame our vendors. It is our fault and it’s unacceptable in any org that Tim runs to blame a vendor. And what I did is I went and looked at some of PagerDuty’s competitors and as I looked at their status pages, I saw that about two thirds of their incidents on their status pages directly blamed an upstream dependency. And it just kind of helped me really establish, hold on a second, what does it mean to be a good corporate citizen in SaaS?

Jeff Martens: And Tim might be sick of me singing his praises about that, but it stuck with me for what now? Seven, eight years that ago that he said that. And I just, it’s now my own philosophy as well. It’s up to us. I would like our vendors to do more to help ensure that my reliability doesn’t suffer, but at the end of the day, software is hard. There’s no such thing as a hundred percent availability. We have to learn to deal with it and we can’t make it someone else’s problem. So as hard as it’s, we have to take ownership of it.

Mandi Walls: Yeah, absolutely it is. It’s such a challenge. It’s such a challenge now. And you almost feel like I encourage people when they’re drawing out their infrastructure, we’re looking at places where they’re monitoring things, where we’re working with them through stuff like add in your vendors, tell me, tell the rest of your teams where your production flows, where your revenue generating customer flows are dependent on a third party. They are part of your architecture that is part of what you should be paying attention to. So many people don’t bring them in that way.

Jeff Martens: Yeah, I mean absolutely. Service maps are becoming a really common thing. There’s a lot of great tools out there to help us build them, but they still tend to be really focused on the services that we build and own. And what I found in my work in this space over the years is that the typical app operating at any level of scale is going to have 50 or more 3rd-party dependencies that connected to product experience. And the average company, it’s going to have 130 SaaS vendors. And so a lot of companies might only have 20 or 30 services that make up their product experience, but you could have 50 3rd-party dependencies. You could literally have more third party dependencies than you have services that you own and manage yourself. And so if you don’t include those in your service maps, if you don’t have runbooks for those things, you’re missing potentially more than half of your infrastructure.

Mandi Walls: Yeah, oh absolutely. You’re bringing them in, because it saves you so much time. We’re thinking things like billing and all the shopping carts, the stuff that you don’t want to deal with and you don’t want to have to reinvent the wheel for security reasons, for support reasons or any of that. You bring that stuff in for very, very good reasons, but you cannot ignore it when it comes time to talk about what does it mean for our product to be available to our customers.

Jeff Martens: Yeah. This makes me think of something that I’ve talked about before and I’ve written about and that is that on the surface a lot of people might think to themselves like, okay, of course it makes sense for me to use someone else for payments. Someone else is going to be better at payment infrastructure than I’m going to be. And I agree with that. You absolutely should not build your own payment stuff. You should use one of the awesome products out there to do that. But every time you make that decision, you are reducing the best potential reliability that you can offer. And so a lot of companies want to offer four nines. Four nines is hard, three nines, it’s doable, four nines is doable, but it’s tough. But if you think about that, we often say, well, if I have 50 services and they all offer four nines and they achieve that promise, I will also have four nines.

Jeff Martens: But that’s actually not true because it’s rare that all 50 of your vendors are going to have outages at the same time. And so they all have a 0.01% downtime, multiply that times 50 and then subtract that from your 99.99 and that is now the reliability that you can reasonably promise. Now, you could probably have better than that, but if you want to make a promise, you can’t promise any higher than four nines minus all of the services you use and whatever they have, that’s under a hundred percent. And that’s where I think it becomes really eye-opening is you’re like, whoa, I really want to offer four nines to my customers, but now I don’t control a huge chunk of this. How can I do that? And again, like I said, you can still deliver better than that promise, but you still have to walk in understanding that the risk is there’s a very good chance you won’t be able to depending on what happens with your vendors over time. And we all know outages happen. They’re going to happen. You can’t plan on them. Not happening.

Mandi Walls: No. And I’ll say the folks are improving on their notifications about issues, but not everybody is improving all together. And some folks are still trying to hide things or it takes so long for them to post that, Hey, this subsystem isn’t working, or we’re seeing an outage here, or whatever it is because they’re afraid of being front page New York Times or whatever their issue is. Then they have a glitch or a hiccup or anything that those of us who are consuming their service don’t know when to pull the plug. Do I fail over out of this thing? Do I hit the red button and do some graceful degradation of this system because my vendor isn’t telling me that they’re having an outage, but it really looks like they are.

Jeff Martens: Yep, I’ve seen the same thing you’ve seen where there are a lot of companies that are doing better. I will say I think most companies want to do right by their customers. I think there’s very few that are actively saying, no, I don’t want to. But what happens is when you get into the nuts and bolts of how you do it and when you do it, that’s where things start to change. But it’s not because of bad intent for the most part. So some companies are getting better at it. One that I’ve always appreciated is Zendesk. I think the team there and the person that runs it that I know, they say, look, this matters to us. And actually for them, a lot of times it’s because they expect the same from their vendors. So they want to give that same respect to their customers.

Jeff Martens: But a lot of companies aren’t investing in improving there, and that’s why I’ve long called for a standard. I think we have to have a standard in the industry that says, this is what you need to do to be a reliable cloud vendor. We have standards for a lot of areas of risk. There’s tons of frameworks in the GRC space for how you prove that you’re secure, how you prove that you have disaster recovery. And this is an area where I think we can have a standard. And the reason I think that matters is it can start the conversation and it can get all these companies that are that know they need to do more. It can get them off the sideline and start making improvements towards this standard. I don’t necessarily think we have to have certifications. You don’t have to have a SOC 2 for how you communicate with your customers, but if we can set an expectation that vendors and customers can get behind, we can start working towards that and we can create a roadmap for what does it mean to be a customer-first SaaS vendor?

Jeff Martens: I think that might be a lot of what’s happening here is people are saying the head of SRE saying, I want to do more, but I often don’t know when the right time is to communicate. You get an alert and you’ve got an issue and it’s happened for the last five minutes, do you go push a button and blast all of your customers and say you have a problem, or do you wait until it’s been going on for 10 minutes or until you know what the problem is? That’s a hard question to answer and I can’t answer it for anybody. I can answer it for my business and my company and my product. But how do you think about that? How do you evaluate that? How can each individual SaaS vendor decide what their process is to determine that with the goal of notifying the customer as soon as reasonable and give them all the information that they need? And maybe that’s just starting by saying, look, we’re investigating a problem. That is what a lot of people say now, unfortunately, they say that 29 minutes after their incidents typically have started, and that’s data that I validated from monitoring directly of these third parties. And so instead of waiting 29 minutes to say, Hey, we think we have a problem, we’ll let you know more. Do that at five minutes in and then be able to come back with more specific details.

Mandi Walls: Absolutely. And I think the notification early will also help the vendor because then your users can actually tell you, oh yes, they can coalesce with that. Yes, we are seeing this as well. This is how it’s impacting us. This is what we’re seeing on our end. And you can use that to coordinate the rest of your user data back into that incident or however you’re managing it so that you have more telemetry to then triangulate off of for impact and for your own SLAs. Without getting too much into just blasting everyone off, having the status pages, I think for a lot of folks is a huge improvement over just having nothing or trying to manage it over social media, which we were doing for so long.

Jeff Martens: And I really like what you just said about this idea of crowdsourcing this information together for the benefit of the vendor. I want my vendors to be able to resolve their incident faster too. So why aren’t we working together and why aren’t all the customers of a specific vendor saying, Hey, here’s what I’m seeing. And maybe that’s manual by clicking a thumbs up or thumbs down. Maybe that’s sending a support ticket in, or maybe it’s just sharing our monitoring data. Because I think for a lot of companies, the ideal scenario is they’d have outside in monitoring of their own systems for every single customer. The reality is that’s not going to happen. It’s time consuming. It’s expensive. It’s going to be a load on your infrastructure. So I don’t think anybody should do that. But if you’re the customer and you say, look, I have 50 dependencies and here’s my eight most critical ones, I’m going to start doing some direct monitoring of them, why not share that back? And so you can help the vendor have a better understanding of the blast radius of their incident. Is it geographic or is it a certain version of an API? What exactly is it? And use that to resolve the incident faster or at the very least just put better information on the status page.

Mandi Walls: Absolutely. And that’s one of the first things we have when we have a big incident. What’s a customer impact? Is it related just one subsystem? How wide is it? What’s the workflow? Any of that. And if more of that data comes in from customers, you get more information. It just has more information to your troubleshooting so you can pinpoint where the issues are even faster. So yes, I would definitely appreciate folks putting that stuff together. So as we work with vendors and we hope for these relationships to improve in the meantime, incidents are going to happen. And so as a user of a vendor or the consumer of a vendor, what are some things you have thoughts on? What should you collect and know about your vendors before an incident happens to help you sort of manage, have a good outcome there?

Jeff Martens: Yeah, absolutely. So I think the first thing is to know who your cup point of contact is at a vendor and make sure you’re talking to them before you have an incident. If the first time you contact them is in the middle of an incident, you may not be getting the support or the attention that you could be getting if you had built a relationship with them. There’s some small things that I think could be really impactful. One of the things that I’ve seen happen before at companies I’ve worked at, but also clients that I’ve worked with is an employee three years ago signed up for a service integrated into their app and then a year ago they left. And nobody knows whose email address is even used to log into this service. And even if you do know now that email address is gone, how do we get a password reset?

Jeff Martens: And so I think, and maybe this is in your runbooks, maybe it’s in your service directory or both, but just having an understanding of what is our dependency on the service? What is all the information we need to know about the account number, how we log into it, who our rep is. And I think we should be pulling our vendors into more of our operations like post-incident reviews, but let’s not wait until our annual contract renewal comes around to build that relationship. And unfortunately, that tends to be what happens is your annual contract renewal comes up, your vendor sends a product manager, an engineering director out to shake hands and kiss babies, and you have a warm fuzzy feeling and then you sign the contract, they know they’ve got you for another 12 months and then neither of you contact each other again. So that’s an area where I think just building relationships and having those connections can be a huge help.

Jeff Martens: But I also think for the things that you’re most critically dependent on, things like payment processors, things like your compute and storage and databases set up active direct monitoring, it’s something that absolutely can be done. It’s really not hard. It’s essentially synthetic monitoring. But where most synthetic products are designed and for you to use against your own services, you’re just taking it and flipping on its head and say, I need to validate the services that I depend on the most. And again, going back to the start of our conversation, treat it like it’s a service that you built. You would frequently set up synthetic tests for a service you’re dependent on within your org. Why wouldn’t you also do that for a third party? And I’ve read a ton of terms of service and almost all terms of service allow you to do that as long as you’re doing it for the benefit of knowing the reliability of this vendor.

Jeff Martens: So you’re totally within contract terms to do that. Oftentimes you can do that without incurring any extra expense, but it gives you this level of visibility that you just can’t have through typical methods observability like APM monitoring, you’re going to get so much more because one of the things I’ve always said again is it’s not hard to keep an API gateway responsive, but it is hard to make sure that all the services behind it are doing what they’re supposed to do. And APM is not going to tell you that what functionally is or isn’t happening, they’re just going to tell you if the HTTP response succeeded and what the code was. So you have to do more nowadays.

Mandi Walls: Yeah, absolutely. I think too for folks on the business side, I know after the past couple of years there’s been a lot of contraction of budgets and things like that. And for the vendors that are in your line of revenue, you really should be buying support contracts that are commensurate with your reliance on that vendor. And I know it’s a cost center, we know it costs money, but that relationship that you get the answers to the questions that you get, the support that you get off of a premium contract or whatever they platinum contracts or whatever they call them at your vendors, are going to help you in these situations as well. And you’ll have often that VIP line when there is an incident or when there is a problem going on, we see so many folks that are for budgetary reasons trying to balance that out and scrape off part of that contract and then something happens and you don’t have that anymore.

Jeff Martens: Yeah, it’s one of those things that I don’t think people realize the value of those support contracts until they need it and then they go, oh my God, I’m so glad I had it. But how many people have gotten to the point where they didn’t have it and they don’t even know what could have been? I mean, as you were sharing what you said and what I fully agree with, the analogy that came to mind for me is if I buy a Ferrari, I’m not getting just liability only car insurance, I’m getting comprehensive. But if I have a $2,000 beater that I use to haul firewood around in, you know what? liability only is probably just fine. And I think you could take that to your vendors as well. You look at the 50 or so that you have and put ‘em into tiers. This is a tier one vendor and for all of our tier ones, this is what we’re going to do.

Jeff Martens: One of those things should be if the support contract’s available, get it. But then you have a bunch of other things like, Hey, these are the relationships and I need to build with my account manager, with my solutions architect, that vendor, whoever it may be. These are the ones I’m going to start with when I’m making sure I have runbooks for my services. And then you can have tier two and tier three and tier four just like you have for your own services and just like how you evaluate incident severity, it’s not hard to do. We just have to take the time to do it. And unfortunately as we’ve advanced in how we develop software, we just have to do it 50 or a hundred times. And I think that’s the problem. Any one service, it’s not hard to put this level of visibility and relationship in place. It gets hard when you have to do it dozens of times.

Mandi Walls: Absolutely. Absolutely. And there’s so many things that are at different parts of your workflow, and it’s harder to justify buying up on things that are very much on the left up in the early parts of the workflow versus putting your money where your mouth is in your production systems and those dependencies and being very clear about these are the things that our customer facing, and maybe there’s money to be saved on the things that are a little bit further upstream from your customers or a little bit to the left of your application development process that can be reaped back, but yeah.

Jeff Martens: Yeah. Well, and it’s also, it can help you operate better as well. Maybe we don’t need to scramble for a tier four vendor unless they’ve been down more than 60 minutes or whatever it is for you, but we don’t need to have chaos in our real time work. We could plan how we approach our real-time work. We may not know what that work’s going to be, but we can prepare ourselves to respond to it and act on it in a way that makes sense for our business, what matters to us, what we promised to our customers and what we know is in the critical path.

Mandi Walls: Absolutely. There’s stuff that you can just wait out, okay, maybe we want to have a deployment freeze because there’s something down in the chain, but it doesn’t mean that we need to scramble to an incident call and call a SEV-1 on this particular vendor incident. We just kind of let it ride.

Jeff Martens: Yeah. You actually just touched on something that I hear frequently is folks saying, well, sure, I’d like to know more and faster when I have cloud dependency issues, but I can’t do anything about it. So it’s just nice to have information, but I disagree. There is a lot you can do about it. And I think for certain tier one vendors, it makes sense to have a backup if your business is dependent on processing financial transactions every second, like an e-commerce provider would be, you should probably have a backup payment provider, and how could you put logic into your app to help determine when you process with one payment provider over another? I met a startup recently that is approaching this third party risk from that standpoint is how can we make it so you don’t even have to worry about a payment process outage because we are helping you manage that to the point where a human doesn’t even need to come into place.

Jeff Martens: Now, you don’t need to do that for everything. That could get pretty expensive. It can also get risky to have somebody managing your logic for you, but if it really matters to your business, you probably should. And then I think about things like AWS, Azure, GCP, like your core infrastructure. You’re probably not going to be doing a lot of failovers in that case because that’s a little bit harder to do. It’s more time consuming. In some cases it just doesn’t make sense. But we have disaster recovery plans for if something happens to an entire AWS availability zone or region, but do we have disaster recovery plans for when a certain thing just fails and it’s going to be down for an hour? So we often think about disaster recovery from these big scary things.

Mandi Walls: Catastrophic.

Jeff Martens: Exactly. But what about that hour long outage? Because as we know, depending on your company size, you’re losing hundreds of thousands of dollars in both real money, but also brand reputation for every hour you’re down. And so I think having some kind of plan of action around how do we make the decision that we want to shift everything to a different region? What does that decision process look like? And then you can take action. And so at the surface level, it seems like, well, there’s not much I can do, but when you dig into it, there is a lot you can do both when an incident’s happening, but also when you prepare for incidents. Another thing I’ve always encouraged people to do is evaluate and analyze the reliability of your vendors. Have that conversation with them before you re-sign your contract. Take a look at the competitors and make sure you’re using a vendor that has a reliability and an SLA that matches what your needs are. Have those discussions with them. Let them know what your needs are, share the data you have about them with them and make this an ongoing conversation. So there definitely is a lot you can do. It just takes that time to kind of sit back and say, well, where is my greatest risk in my stack and where can I start chipping away at this?

Mandi Walls: Yeah, absolutely. And it’s like it’s a discussion to have not during the vendor’s incident. You want to have the, I kind of think even though this may not have been an incident that was on your systems, having a post-incident review of a vendor incident I think is super important if it’s one of your tier one, as we kind of tag them vendors to be able to say, here’s what we saw. Here’s how they notified us, here’s when we saw it versus when they notified, here’s how it impacted us, revenue impact, blah, blah, blah. And then you have now sort of a baseline to look at the switching costs if you want to maybe change to another vendor or take another look and start having those internal discussions to say, here’s the risk associated with this vendor based on this current outage that we’re talking about, plus maybe the last three or four over whatever timeframe you want to look at, and be very methodical about it and actually have those kinds of discussions whether instead of going on vibes because it’s super easy to say, well, we’re using this vendor and they’re down all the time.

Mandi Walls: And then you look at it, it’s like it’s been maybe eight minutes over the past 12 months. It just feels more impactful.

Jeff Martens: Yeah. This is why I think it’s absolutely critical that you’re monitoring, you’re monitoring your vendor’s reliability on your own. Nothing is going to tell the story better than data. You can put that data together without any kind of active of monitoring, but it’s typically time consuming. It’s inaccurate at times, and it can be really hard for your vendor to accept it if you just go through and you’re manually adding up minutes and you’re saying, well, I think we were down then, or I think things were too slow then. So I think it’s critical to have your monitoring set up so you are independently tracking these dependencies, and you could use that in your post-incident review, but you can also use that in your relationship building going forward. And about post-incident reviews, one of the things that I learned from the folks that manage reliability over at Zendesk is the idea of asking your vendors to join that post-incident review. If you’re paying a significant amount of money, I bet if you ask, they’ll send somebody to your post-incident review, and that’s going to allow them to develop empathy around what it means to you when their software goes down, and it’s going to allow them to be part of the solution. So it’s this idea of treating your vendors as just an extension of your team, not a totally separate team.

Mandi Walls: Yeah, absolutely. At some place, there’s probably going to be some synergy where maybe they have recommendation for additional settings or tolerances that you could be using, or maybe your suggestions, give them a better graceful degradation setup or something back and forth there to give everybody a place to learn and grow from and improve from that you’d be missing without having them as part of the conversation. So we’re all learning all this stuff all the time.

Jeff Martens: And like I said earlier, I think most of us are trying to do our best. We have the right intentions. I think it’s time now that we make this a priority. In the past, over these years as we’ve been ramping up our usage of third parties, I think it’s always been one of those things on the side where it’s like, yeah, but they’re pretty reliable, so we’re good. And there’s not much I can do about it. But the more and more we rely on third parties, the more we hear about the issues they’re having in the news. And oh, by the way, most issues never hit the status page, let alone the news. Of course. So there’s way more than we think. I think it’s time that we say, you know what? This is now how we develop software. We rely on amazing solutions from cloud dependencies, and we need to start making this a first class element of the way we observe our systems and the way we manage instant response and the way we avoid outages in the future and the way we manage our vendors.

Mandi Walls: That sounds an excellent call to action right there. Get folks enthusiastic on this.

Jeff Martens: It’s going to happen again. We can all agree right now as we talk about it and as we listen to this, but at the end of the day, we have to take action. And I do think that these incidents that our vendors have that can impact us, they end up being one of those things. It’s an urgent thing right when it happens. But after a few days or a couple weeks, our memory is not quite the same about how painful that was. And it was like, well, I’m still here today, two weeks later, and my business didn’t blow up and I didn’t lose all my customers, so maybe I don’t need to invest in this. But I think what we’re forgetting is that as these things add up over time, when we come around to those one and two and three year contract renewals, that’s when people say, wow, I really feel like my vendor didn’t do right by me these last 12 months, and that’s just too late for you to address it. So I really think this should be front and center, top of mind for us. This is a critical aspect of observability.

Mandi Walls: Yeah, I agree. Absolutely. Jeff, this has been great. This is exactly what I wanted to help folks deal with.

Jeff Martens: Awesome. Well, I always enjoy talking with you. As you know, this is something I care about deeply and this has been a lot of fun for me today. So thanks for having me.

Mandi Walls: Awesome, thank you. Where can folks find you if they want to learn more?

Jeff Martens: So you can find me on LinkedIn. My name’s Jeff Martens. I’ve got a personal website, jeffreymartens.com, not doing the Twitter thing quite as much, but every now and then you can find me there. And I’m going to get back to building soon and back to creating delightful user experiences. So keep an eye out for what I do next.

Mandi Walls: Awesome. We’ll be looking forward to it. For everybody out there, we’ll wish you an uneventful day. We’ll be back with another episode in a couple of weeks

Mandi Walls: That does it for another installment of Pager to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast. Remember to subscribe to this podcast. If you like what you’ve heard, you can find our show notes at page it to the limit.com, and you can reach us on Twitter at page it to the limit using the number two. Thank you so much for joining us, and remember uneventful days, our beautiful days.

Managing Vendor Incidents With Jeff Martens

Transcript

Show Notes

Additional Resources

Guests

Jeff Martens

Hosts

Mandi Walls (she/her)