Chaos Engineering With Bruce Wong

Posted on Monday, Dec 9, 2019
In this episode host George Miranda chats with Bruce Wong, Director of Engineering at StitchFix, about practical ways of getting started with Chaos Engineering.

Transcript

George Miranda: Welcome to Page It To The Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry, to improve both system reliability and the lives of the people supporting those systems. I’m your host, George Miranda, @gmiranda23 on Twitter. Today’s episode was recorded during the PagerDuty Summit in 2019. I got a chance to speak to Bruce Wong, director of engineering at Stitch Fix. Bruce and I were able to collaborate on running a chaos engineering workshop, where we spent an hour introducing a ballroom full of people to practical ways of getting started down the path of intentionally introducing failures into production systems. Bruce and I had just finished running that workshop when we sat down for this chat. Hey, Bruce, how’s it going?

Bruce Wong: Hey, George, pretty good.

George Miranda: So, why don’t you start us off by telling a little bit about yourself and give us an intro to who you are.

Bruce Wong: Yeah. So, my name is Bruce Wong, I’m director of engineering at Stitch Fix but I also like to say I’m an engineer and a developer at heart still.

George Miranda: So, you do a lot more than that, right? You are not just at Stitch Fix, you’re part of a number of boards, you’re part of our board, as well.

Bruce Wong: Yeah. So, I advise a lot of different companies. I like to center myself as an active member of our community, so attending various conferences like this one, as well as talking and networking about all various topics, including chaos engineering.

George Miranda: So, I think you’re maybe being a little modest. Tell us a little bit about your big claim to fame.

Bruce Wong: Yeah. So, my claim to fame is coining the term chaos engineering. I did it years ago when I was at Netflix and I had inherited the reliability effort across the company. And part of me was like, okay, looking at how do we do this? How can we go from three nines to four nines? How do we incentivize teams? How do we get this done without a lot of operations engineers? And so, one of the strategies I came up with was out of chaos engineering. We had already had chaos monkey, we already were doing a lot of the practices but the thing that I came up with was, okay, let’s actually create new a team strategy and a vision around this and let’s double down on what we had already started. And so, in that fashion, wrote a blog post, introduced the term chaos engineering, introduced the term chaos engineer and then it was off to the races from there.

George Miranda: And so, for our audience, can you break that down a little bit? What exactly is chaos engineering? How do you wrap that up?

Bruce Wong: Yeah. So, I think there’s been a lot of different resources on chaos engineering. Google search it, there’s a lot of different definitions out there, some are academic in nature, some are very practical in nature. The way that I would summarize it is, this is being proactive in our chance to validate our resilience design. This is finding out about how well our systems are architected at 3:00 PM instead of 3:00 AM, I like to say.

George Miranda: And so, we intentionally introduced failures into production, in a controlled way, in order to make our systems a little more resilient. Is that fair?

Bruce Wong: That’s fair. Controlled and proactive is how I would say it.

George Miranda: Awesome. And on that note, what we just did, which thank you so much for the suggestion. We had an entire workshop at PagerDuty summit, that was inspired by a practice that you gave to us, which I think is absolutely amazing. We ran some tabletop thought experiments as a way to introduce chaos engineering, maybe demystify it a little bit and hopefully they give attendees away to try this with your teams. So, can you tell us a little bit more about how those came about or what they are?

Bruce Wong: Yeah. So, after my run at Netflix, it was a great run, I learned a lot of things, I played a lot of different roles, I was there more and a half years. After that, I had this notion of chaos engineering, it was very successful and impactful for Netflix. And at the next company I was at, which was called Twilio, which is an API cloud communications API company. There was an open question to me, it was, can I apply chaos engineering here? It’s a different set of products, different set of people, different set of challenges but that was the thing in the back of my head was like, is this practice, is this actually transferable beyond Netflix or not? And so, I got to try that out of Twilio. And so from that, that’s where I came up with, okay, the best way to get started is really getting started with the culture and the mindset. And that’s where getting started at Twilio and then later Stitch Fix, I utilize a tabletop with a zero tech, I call it zero tech tabletops because I don’t want laptops, I don’t want distractions and excuses of why we can started. And so, I run these tabletop exercises that are a lot of Q and A, there’s a whiteboard, there’s a drawing, we draw the architecture, we talk about what our detection strategy is, we talk about resilience, we talk about those trade-offs. And then we talk about, well, this part fails, what happens? What do we think should happen? And that’s the start change that mindset. And then from there, it’s very easy to build on the intentional failures and actually causing those failures in our systems because we already have that mindset of, what would go wrong and how do we anticipate that?

George Miranda: I love that. I think that’s one of the things that I really loved about the things that you said today, right? Which is, it’s a lot about figuring out how we get our teams to shift into the mindset of introducing these failures. And I got to tell you, the common critique that I hear whenever we say the term chaos engineering is, oh, well, there’s plenty of chaos in my stack already, haha, right? I don’t need to engineer more of it. But I think really what we’re talking about is, figuring out how you don’t really introduce chaos, right? You are purposefully breaking things in a controlled way so that you can learn from it and that’s a hard sell, right? That’s a really hard thing to get people on board with. But again, right? I think one of the critiques that I hear is, our systems aren’t ready for that, right? So, how do you address that? What do you do about that?

Bruce Wong: Yeah. So, first off, it depends on who I’m talking to. And then second off, I look at it as, well, to that point, our systems are already chaotic enough as it is. It’s like, well, if we’re not ready for this then are we really ready for production? And ready or not, failure will happen. And so, again, it’s not whether… it’s not an if, it’s not whether or not failure happens, it’s when it happens and it’s whether or not it’s on our terms or not our terms. And so, given the example of a server or instance or EC2 instance or whatever it is, a unit of deployment, unit of server, goes down, right? That happens all the time and people would argue, well, why do I need to introduce that? And it’s like, well, if I have the fix in place and I have the resilience to that, do I want to wait for it to randomly happen? Or would I rather get that feedback loop that it works now? Or that feedback loop that it doesn’t work now, right? So, the choice really is 3:00 PM or 3:00 AM and the choice is, controlled or uncontrolled and that’s really the choice. You cannot opt out of, well, I just don’t want things to fail because that’s just an inherent part of distributed systems and part of networks and part of our technology.

George Miranda: I love what you just said. So, I have a big background in emergency response and as emergency responders, one of the things that we realize is that, yes, failure is inevitable and you never know what those failures are going to be. But the way that you deal with it is by purposefully going through what you know the failure modes are going to be, right? And what are the most likely or most critical things that are going to create the most damage and practice those, right? Try to understand what’s going to happen in those modes and start there, right? And so, when I think about managing a production stack, how do you start to tackle that, right? What do you do with your teams to figure out, what are the big impact items? And then my followup to that would be, if it’s such a big impact item, how do you make management okay with the fact that you’re going to fail this thing you know causes a ton of collateral damage?

Bruce Wong: Yeah. So, I think there’s a couple of things. Starting small is the most important thing, right? Don’t simulate USCs [inaudible 00:09:22] going down, first, that’s probably a little bit vague. Start with an instance, perhaps, that’s a much smaller service and much easier exercise and there’s not much convincing that has to go there because the odds of an instance going down are extremely high, compared to that of an entire region going down, right? So, start small, start with a likely, right? And then build from there. So, okay, the incidence goes down, now let’s try to zone or let’s a whole service. And the other thing that I like to say is, outages are opportunities. And we all have experienced outages, we’ve all experienced a cloud provider outage that impacts something, right? Those are the best opportunities because it’s like, it allows us to take a look and be introspective about, well, what are the things in our control that we can do about this, right? And a lot of times, there was a reason we wanted to use a third party provider in the first place but there are still things that we can do around that, as part of that decision and owning that decision and what that means for us. And so, I think that’s the thing I would say is, if you wait long enough by not doing this, an outage will happen, eventually and that is the perfect time because that is top of mind because if an outage is large enough and it happens, an executive is very interested because there’s probably a cost to that outage. Here’s what the cost to our business is because of said provider outage. And then that’s top of mind, so you can talk about, well, here’s some of the things that we could do to mitigate that, to make the cost of that go down, to make the cost of it nullified completely, right? That’s the perfect time to talk about those things.

George Miranda: It’s true. I think those types of outages you know are going to happen, especially when you rely on third party services. I think that’s a fairly obvious place to start, it feels like something that’s beyond your control. I think, my concern or one of the things I often hear is, those services, we tend to rely on more and more increasingly, right? Very few services are entirely within our control. We’ve got a cloud provider running our infrastructure, we’re running some SASS services on top, there’s so much abstraction that we rely on. How do you start even prioritizing that? How do you figure out which one is an obvious choice?

Bruce Wong: I think it depends on a lot of different situations but I usually start with the things that you most control over and that is our code, software bugs are part of software development, right. We can try our best to avoid bugs but they still make their way into production, that’s why outages happen. And so, if that’s the place that we can start and we have the most control over, that’s a perfect place to start of, okay, we wrote the dependency into our code, right? Those dependencies don’t end up there magically, we wrote those dependencies there and when those dependencies go down because we wrote the code, now we actually can write code around what should happen or should not happen. So, there’s a choice, when some dependency goes down, whether internal or external, should we display a 500 to the user or not? Probably not. Okay, so then what do we do instead? How do we think through that experience? Furthermore, like I also say, start with instrumentation. Do you have the monitoring and telemetry to see failure when it happens? And if you don’t, well, let’s get started there. So, those are very easy to get started, those are very valuable to get started. The worst time to find out that I don’t have the right logging, is the actual outage.

George Miranda: True.

Bruce Wong: The best time to find out that I don’t have a log that I need for that outage is 3:00 PM during the chaos engineering exercise. That, oh, it would be really useful if I had this in the logs or be really useful in my had this metric tag by this dimension. I’d rather find that out not during an actual outage, I’d much rather find that out on regular, normal, day, thinking that I have full control.

George Miranda: One of the things that I love about the tabletop thought exercises that we were doing is that, there were instances where we could flesh out some of these things before we touched any code. One of the topics we started with was telemetry, right? If we’re going to introduce a failure into production, how do we know it’s actually happening? And as I stood there listening to one of the tables, they started talking about metrics and dashboards and one of the concerns that immediately came up was, well, do the right people have access to those dashboards, right? In my organization, you’ve got very limited views and access control over who sees what. And in a failure scenario where you might involve multiple service owners and multiple teams, just thinking about who is even able to access those systems, you’re right, you don’t have to wait until this outage happens in production. You don’t even have to wait until the chaos exercise happens for real, right? You can sit and think through those things with your teams.

Bruce Wong: Absolutely.

George Miranda: And as a follow up to that, I would say, so when is the right time to take a failure, right? We’ve talked about starting small and fail the instance [inaudible 00:14:45] region. When do you know that you’re ready to try it for real?

Bruce Wong: Yeah. So, I actually would say the question to ask is, when’s the time that you want to start writing more resilient software? That’s probably today. And that’s what I mean by, if it’s about getting started. And so, one of the things that I instituted at Twilio was part of onboarding, we were hiring, just like everyone else, hiring is hard but you get someone, you get someone new to your company, into your team and of course they have to carry the pager too. That was the first part of onboarding was, chaos engineering because yeah, I want to find out that they don’t have access to the dashboards or they do have access but they don’t know what the metrics actually mean. And so, there’s all of that learning that you could sit down and have someone explain all these things or you could sit down and actually simulate a failure that you can see in your dashboards and then explained it in that kind of context. And I think those muscles that you build of, okay, this is this shape, this is a shape of what that looks like on my dashboard. This is the dashboard and this is the difference between this one and this one. I have access or I don’t have access, all things you absolutely want to learn, not at 3:00 AM in the morning. And so, I integrated that as part of onboarding any new employee to a team because that’s critical, critical information that you want to have before your first [inaudible 00:16:16].

George Miranda: Yeah. And I think one of the things that can easily get uncovered as well is, just because you have metrics, it doesn’t mean that people understand what those metrics are or what the baseline is, right? Or what is abnormal behavior.

Bruce Wong: Right. And it’s much easier when it’s a controlled exercise, in the middle of the afternoon, no customers actually impacted, that, yeah, I can go into my code and I can see exactly what is happening and exactly what that metric means. But I can’t tell you how many times that, the team itself learns together because I love seeing teams debate like, no, that metric means this. No, I don’t think so, that metric means something else. And then they open GitHub and look at the code and the code doesn’t lie, right? And everyone rises together and has a better shared understanding of the important operational aspects of our software. The interesting thing that I’d say is fascinating, when you do this often enough, my team’s behavior is predictable and people, humans, adapt and it only took, it literally only took, three weeks. Doing this once a week, for three weeks, where all of the developers on the team, including the most junior developers, whatever they were developing, they knew that I was going to break it every single week at the chaos engineering meeting.

George Miranda: Right.

Bruce Wong: And so, what I noticed that happened within three weeks, for everyone, from our most seasoned senior engineer to our most junior engineer out of college, all of a sudden our code started having… It was always instrumented beforehand, there was never a cache there. We always had adequate logging, we always had resilient code, out of the gate, the first time out, it wasn’t an afterthought because we had to change that mindset. And the funny thing is, they were just doing it at first to prepare for the game day.

George Miranda: Right.

Bruce Wong: And then it clicked with them, they’re like, well, our system is rock solid, right? Because it’s, yeah, because we’re actually doing this. And so, that’s the most fascinating thing is, when you see that mind shift change on a team, that the code is just that much better and the PRs are that much higher quality because we never are missing instrumentation. We always have resilience, we always have fall-backs for everything that we introduce, that’s when you know that you’ve really changed how a team is thinking about this. Now, the funnest part of doing all this in my career is, those moments when the real outage happens. When you’re the team and the big outage happens and you’re the team that’s actually celebrating, you’re celebrating because, hey, this thing failed exactly as we planned and in exactly as designed and there’s nothing that happens, there’s nothing for us to dom we’re just sitting back, watching the show.

George Miranda: Yeah. Being able to celebrate those failures, I think, is absolutely important. Not to branch in a different direction but one of the takeaways from the post-mortem workshop that we did earlier in the week was a really good conversation that I had with folks around doing learnings around wins, a situation where a failure mode happened exactly as you expected and the right thing happened, there was no incident but taking time to review that and learn from it is still valuable. So, let me ask you, the things that you learned in these incidents, in these chaos engineering exercises, where you’re intentionally introducing these failures. How do you make sure that some of the work that falls out of that, right? I mean, your team start instrumenting their code better. Right? You start building some of these mechanisms in but inevitably you always find some other thing you didn’t think about, right? And that’s extra work that was unplanned. So, how do you deal with making sure that that actually enters the work stream and something happens?

Bruce Wong: I mean, I think, it depends on the different teams but for most of my teams, we practice a lot of the standard agile ceremonies. So, we do sprint planning and the key that I figured… The first time I did this, I did sprint planning and then the chaos engineering exercise. And then I realized, nope, that’s the wrong order. Because after the chaos engineering exercise, there was all these things that we wanted to do but we had just done sprint planning. So, then the next week we flipped them and we always, always, always, always did the chaos engineering exercise and then we did sprint planning. And so, it just naturally flowed from, well, what did we learn and what do we want to do in light of the experiments and exercises we did? And then it was, well, let’s start prioritizing those first because we’re now in spring planning, the very next meeting is spring planning, right? And so, that broke us into a cadence in this, it literally was integrated into our weekly sprint cadence. And so, it was top of mind and you just knew that was the order we did it in. We always did chaos engineering first and then we did sprints right after that.

George Miranda: That’s an awesome, that’s a really great virtuous cycle, right? Identify those problems and just work them right in. So, I think that’s a really good overview and tour of chaos engineering principles. Do you have any parting advice for listeners that are thinking about how they introduce this in their organization and how they can get started?

Bruce Wong: Yeah. I think my parting advice is to summarize, kind of the things I said, is, remember to start small. Change starts with one person, then one team. You don’t need fancy tooling, you need three lines of code, right? If my user, fail is called. Start really, really, really small, start really, really simple and then expand from there. Don’t try to do a company-wide change overnight. Get the small wins, get that validation, build those muscles, build that language on how to talk about these things and start really, really, really, really small. And then talk about it, talk about those successes and those wins we’ll go from, oh, we lowered our error rate and that’s great and we did this through this mechanism. And then you’ll get some of those big ones because failure is inevitable and outages happen and you’ll be that team. You’ll be that team that the big outage happened and your systems were fine, everything failed as planned, as designed, with perfect visibility and those are the great moments that you get to talk about. And when you get to have one of those stories, you’d be surprised how fast your boss or your executive will catch that and then start pointing other teams to be like you. And then you have change.

George Miranda: I love it. I think that’s a great way to prepare for the disasters we know are inevitable and a great way to approach that and make it something that we can easily start practicing. So, thanks Bruce. Thanks for being here, thanks for being on the podcast, thank you so much for inspiring the workshop. And for the folks listening at home, thank you so much for listening to another episode of Page It To The Limit, this is George Miranda wishing you an uneventful day. That does it for another installment of Page It To The Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you liked what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter at pager to the limit using the number two. That’s @pageit2thelimit, let us know what you think of the show. Thank you so much for joining us and remember, uneventful days are beautiful days.

Show Notes

Creation of the term “Chaos Engineering”

Bruce tells us about how the term “chaos engineering” came to be and the mindset behind using the term.

“Let’s create a team strategy and vision around [Chaos Monkey and the practices around it] and let’s double down on what we already started. So in that fashion, we wrote a blog post that introduced the term ‘Chaos Engineering’ and introduced the term ‘Chaos Engineer.’”

What does Chaos Engineering really mean?

Bruce breaks down the pragmatic reasons this practice exists and why we should think about adopting it.

“It’s being proactive and getting a chance to validate our resilience design: finding out how well our systems are architected at 3pm instead of 3am.”

Chaos Engineering Thought Exercises

We discuss how tabletop thought exercises serve as a valuable tool to help you flesh out considerations long before touching any production systems.

“We call it ‘zero tech’ tabletops. I don’t want laptops. I don’t want distractions and excuses for why we can’t get started. And so I run these tabletop exercises, with a whiteboard, with a drawing of the architecture and we talk about our detection strategy, resilience, trade offs, and the parts that fail.”

But I’m not ready for Chaos Engineering!

A common response to the suggestion that a team adopts Chaos Engineering is that they’re simply not ready to get started. We discuss some ways to address these concerns.

“If we’re not ready for this, then are we really ready for production? Ready or not, failure is going to happen.”

Identifying big impact components to test

How do you prioritize which components of your stack to test? What are the considerations for figuring out where to start? Bruce gives some practical advice for where in your stack to start and finding opportune moments to seize upon.

“Cloud provider outages… are the best opportunities. They allow us to identify and be introspective about the things in our control that we can do about this.”

When should you start?

No, really. Big outage aside, when should we get started? Here’s where we see George’s managerial background kick in. Can we start today? Bruce provides some great practical wisdom around getting started as early as when new team members are being onboarded.

“When’s the time you want to start writing more resilient software?”

When the real outage happens

It’s important to celebrate wins. The Chaos Engineering wins are when you’re the team relaxing as a failure happens.

“You’re celebrating because this thing failed exactly as we planned! It happens and there’s nothing for us to do. We’re just sitting back and watching the show.”

Capturing what we learn from Chaos Engineering

Building more resilient systems means taking the things we learn from Chaos Engineering exercises and ensuring that resulting action items make it into our work streams. How can teams do that successfully?

“The first time I did this, we did sprint planning and then the chaos engineering exercise. Nope. That’s the wrong order!”

Parting Advice

Bruce wraps up with practical tips for moving your teams in the right direction.

“You don’t need fancy tooling. You need 3 lines of code: if my user, fail this call.”

Additional Resources

Guests

Bruce Wong

Bruce Wong

Bruce Wong is director of engineering at Stitch Fix. He formerly resided at Netflix and Twilio, where he founded the Chaos Engineering effort to stress and proactively introduce failure into critical production systems to validate resilience. He is passionate about tackling challenging problems, scaling engineering teams, and building compelling products. In his spare time he can be found applying engineering principles to iterate on BBQ and chocolate chip cookies.

Hosts