George Miranda: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software and production successfully. We cover leading practices used in the software industry to improve both system reliability and the lives of the people supporting those systems. I’m your host, George Miranda, @GMiranda23 on Twitter. Today, we’re going to talk about system complexity and the impact that system design has on downtime. Simpler systems have less downtime. That sounds pretty intuitive when I say it, but for some reason, we don’t really seem to optimize for that in the software industry. In other industries, like shipping and logistics for example, there’s a laser focus on simplicity and the results that they’re able to get with that simplicity are pretty amazing. So for example, Maersk, the world’s largest container ship and supply vessel operator in the world has really ginormous container ships, and their largest container ship only needs a 13-person crew to get that huge ship halfway across the globe reliably. Those ships are enormous, and their internal systems are massive. Their ships make prevalent use of automation, but the underlying systems beneath that automation are a few things. They’re simple, they’re redundant, and the system internals are easy to understand whenever anything goes wrong. The shipping industry, as well as many other industries, have learned that simpler systems have less downtime. But yet in the software industry, it’s a generally accepted truth that complexity only tends to increase over time. Why are we okay with that notion? Why have we not applied some of those same lessons to the software that we create? So on this episode, we’re going to discuss how complexity filters into the production software systems that we build and what we can do to maybe apply some of those same lessons on how we deal with increasing complexity a little bit better. Joining me today to discuss this is Austin Parker from LightStep. Austin, welcome. Glad to have you on the show.
Austin Parker: I’m glad to be here. Thank you so much for having me on. This is something that’s super, super interesting to me. I wrote a blog post talking about some of my thoughts on the system complexity thing. And I think it’s a fascinating, both like philosophical topic because I think you actually hit on that in the intro. That we just take it as normal that our systems are going to increase in complexity as we build them. We don’t challenge that assertion a lot. We don’t step back and say, “Why does it have to be this way?” And so I think it’s good to have these conversations and try to get at that deeper insight into like, well, why do we build things the way we build them?
George Miranda: Yeah, absolutely. And that blog post is part of the reason that we wanted to have you on the show. So before we get started, do you want to tell us a little bit about your background so that our listeners know who you are?
Austin Parker: Yeah, absolutely. So I’m a principal developer advocate at LightStep. I came into the observability space from a more traditional DevOpsy background. I used to work for a company called Apprenda, who’s no longer with us, but they built this really cool [inaudible 00:03:32] at the time, especially platform as a service focusing on .NET. So we had actually invented a way to containerize Windows services before Docker was a household name. Well, some people’s households, I guess. So while I was there, one of my big responsibilities originally was to implement a lot of DevOps and SRE principles, I guess you would say around how we were deploying software, how we were making it maintainable and observable before those terms may be really got big. And I wanted to stay in that same role, but I really wanted to take a lot of the lessons I had learned about building these distributed systems and build tools for it, which led me to LightStep, and here I am now. So I’m also involved a lot in the OpenTelemetry project, which is a big open-source project. Dozens of companies, they’re involved in building this. LightStep, Microsoft’s Google, Honeycomb, to name a few. And the idea with OpenTelemetry is that if you’re building modern software, you need to be able to understand what’s going on in that system. And to do that, you need telemetry data, OpenTelemetry is a API and STK instead of tools, to help you instrument your software for observability and eventually to become an out of the box feature.
George Miranda: Awesome. Yeah, it sounds like you have been steeped in really complex systems and some good solutions around that. Let’s try to get started with maybe zooming back a little then, and talking about the importance of designing simpler systems, especially since we both have backgrounds in dealing with a lot of complexity. So how do you start thinking about that? How would you frame the problem?
Austin Parker: This is actually really interesting because I was talking to someone just a couple of weeks ago and this was a platform team at a company. And they had run into this problem where… And I feel like this is a very common problem. Is we have this large portfolio of services and applications, and there’s a mandate that these all need to share some common layer. For the purposes of this, let’s say authentication. You want single sign-on. Cool. So how do we solve that problem? Because individually, all these are simple applications, but now that they’ve all been combined under this umbrella of this much larger company that wants to use single identity, now we’re starting to add complexity because they’re written by different teams with different idioms, different languages, different deployment methodologies. And the role of this platform team is to impose some simplicity on to this complex design that has been foisted on them. And so I’m sitting there and we’re whiteboarding, we’re brainstorming and coming up with all sorts of very complicated solutions to like, well, how would you do this? How would you do that? And at some point someone kind of brought this up, he’s like, well, let’s step back a second, let’s look at what is something that all these applications are spinning around, what is the one thing they have in common, and let’s start there. And once we have reframed that discussion into, instead of looking at the differences, looking at the similarities, we’re able to make a lot of progress very quickly and say like, Oh, okay, well, you should do this, this, this, this, this. And I think that conversation shows a lot of the problems that are inherent in system design. Because we tend to, even in a team where it’s like you have… Even in like say a greenfield situation, you’re going to have more than one person. And different people have different ideas, they’re going to have different things they prioritize, they’re going to have different things they have expertise on, and we tend to, I believe go to not the most complicated solution, but we go to a solution that feels sufficiently complex because you never want to feel like it was just the easy thing. You feel like, Oh, that easy thing, I must be missing something. This is too simple to work. But in oftentimes it’s not too simple to work. Oftentimes that simple thing is actually the easiest thing to maintain.
George Miranda: Well, that leads me to the intro question that we usually ask most folks on this show. What is a myth about simplicity that you would like to debunk?
Austin Parker: I think the biggest myth about simplicity is that simple and easy are the same.
George Miranda: Interesting. Say more about that. What do you mean?
Austin Parker: If I say simple, a lot of people will hear easy. And if I say easy, some people will think simple, but it’s not actually the case. There’s a lot of things in life, I mean, think about it this way, we’re breathing right now. Breathing sounds very simple, but it’s not very easy if you actually think about what is happening in your body to make you breathe. It’s an incredibly complex series of systems, physical, biological systems that… I’m not a doctor or a biologist, so I don’t really understand them, but it’s something that is very simple, so simple that we do it from birth. But I think if you talk to someone that was an expert about biology and asked, “Is breathing easy?” They would probably say, “Well, no, all these things have to go in a certain order and all these things have to happen and if something was off, it would not work. You would not able to breathe.” And so you can see that also in software design. So sometimes the simplest thing to do is not the easiest thing to do. And sometimes things that appear simple are actually very difficult to implement both behind the scenes and also organizationally. So I think it’s important to keep that distinction in mind just because simplicity can be hard to actually implement and hard to force through for a lot of… I don’t say force through, but it can be hard to convince people about the simple thing because of both technical reasons, like actual technological reasons, but also because of organizational and human reasons.
George Miranda: Yeah. That really makes me think about the things that I consider simple. And I think you’re right. I think they’re definitely not the same thing, but when it comes to simplicity, I think that the concept of something being simple means that we can look at it and we can really easily reason about what’s happening. And so there might be, like in your example, when we breathe, there’re so many moving parts, but we take it for granted. We know it’s a breath in a breath out, we know what happens if we don’t get it, and we can wrap our mind around what it means without a whole lot of effort. And I think that’s the difference, it’s really what we can reason about and that I think tends to make all the difference, especially when it comes to maintaining uptime or troubleshooting.
Austin Parker: Yeah. But I think that goes to… I wrote about that exact point in my blog. Is that those things that we can simply reason about tend to hide very complex abstractions and it’s easy to design for when things go right, but it’s hard to design for when things go wrong. I think like you talked about in the intro, these container ships that are running off of tons of automation and are running very lean, a lot had to be sacrificed to make them that simple. Or that, I don’t know if simple is the word, maybe it’s like uncomplex, maybe there is a level of complexity, like there is some way you can reduce complexity to a certain point where if you take out everything else that doesn’t need to happen, and you say, this is the bare minimum, these are the variables we need to control. And we need to have a way to understand the state of those variables. And we need to have that be presented and managed by humans in a way that doesn’t require a lot of excess humans. Then you’ve reached some like perfect crystal and state and you can say like, aha, this is a simple system. Because we have quantified all the abstractions in a way that they are all manageable.
George Miranda: All right. Yeah. I’m with you on that. Let’s shift the focus a little bit back to software, I think because I’m an infrastructure engineer, I tend to think of a lot of the abstractions that we’ve seen on the infrastructure side these days. And I got to say systems just seem to keep getting increasingly complex. Some of it is for the reasons that you described. Sometimes we want to build something super resilient and something easy doesn’t feel like the right approach and we tend to confuse those things. But regardless of that, it seems like overall as an industry with some of the platforms that we build, with some of the new tools that become available, it seems like nothing is really getting any simpler. You’re right. I think some of the abstractions maybe easier to reason about, but the things that are happening beneath the covers are so complex and it seems to just be getting even more complex. So what do we do about that? How do we fight against that inevitable wave of complexity? Or do we even need to?
Austin Parker: That’s a great question. Some of it is unavoidable, right? I think it’s interesting because you see on Twitter these days, if you follow a lot of DevOps thought leaders, let’s say, but the microservices versus monolith debate has sprung up again, certainly between when I wrote that blog and now, we’ve seen a lot of discourse around like, oh, you probably don’t actually need microservices. And what’s interesting is most of those arguments seem to really fall back to a technical perspective. They seem to be arguing like, look, there’s all these tools and technologies that you can use and yes, they’re cool, but if they’re not actually solving a problem, then why are you using them? But I think the antecedent theory is that microservices were never really about like, this is technically a better way to design software or this was never a technical way to manage complexity. It was always about managing organizational complexity. It was about managing how people build software as humans versus how you write code as a programmer. So one way you manage the complexity of the technical side of this argument, is that you have to have a deeper understanding of your software and what it’s doing. You have to be able to introspect the behavior of your application at every single point that it’s used. A friend of mine has a metaphor where he talks about you build applications like you build a house. There’re a lot of rooms inside that house and people outside of it maybe they see one or two things, but inside of it, a lot of stuff happens. And as you discover like, oh, it breaks in this case, or it’s slow in this other case, you start to add windows into that house. But they’re extremely small and they’re very narrow. I have this one little window that’s pinpoint into the kitchen and I can see what’s going on in the kitchen, at least in this one corner of the kitchen, but I can’t see the rest of the kitchen. And maybe I decide, oh, the house sometimes is on fire on the inside, so I put a little probe for the temperature in a room because that’s the last place caught on fire. And I’ve got a window into that room because it’s the last place caught on fire, I want to make sure it doesn’t happen again. But that only tells me what’s going on in that one room, in this one case. I have a temperature probe, I have a window, I can tell you how hot the kitchen is, and I could tell you if there’s anything visually wrong in there, but I know nothing about the rest of the house, except what I see on the outside. And as you keep doing this you have, I think a production application, something people have been using for a long time from the outside, eventually has like thousands of windows and tens of thousands of probes. And people are standing on scaffolds and ladders and they’re looking on their head and they’re looking inside this thing and no one is ever able to actually step back and say, well, this is actually what the house looks like anymore. Because we’ve covered it with so many ways to measure it and understand it, that we’ve actually lost sight of what the house even is. I think that’s one of the differences, and if I took that and extended this metaphor even further, those windows are like logs, I see what’s happening. Those probes are metrics of some sort, I’m measuring something. And if you look at this concept called observability, that I brought up before, which is this idea of a holistic view of all this telemetry data. You need to have not just metrics, not just blogs, you need to have contextual data, you need to have the ability to trace requests as they flow through the entire house application architecture. And that gives you the context that you’ll know what window to look through, what probes to measure. That idea of creating telemetry, creating an observable structure, that is the way you manage complexity. There’s a lot you can do in a design, like in how you’re designing the software, but all software will eventually reach this point of, you could write Kubernetes yourself if you wanted to and it would be your perfect view of beautiful vision. But as soon as other people touch it and added their own stuff, then now you need observability. You need to be able to have this shared language and logic of performance and a way to understand performance.
George Miranda: I want to say the analogy that you use about windows and probes and losing the vision of the house that you originally built just because there are so many different ways to look at it, it’s really interesting because you present that as part of the technical argument of what’s happening. And I think that a lot of the problem with complexity is that we’re really solving for a people problem that just happens to be implemented with technology. Because at the end of the day, these systems end up being monitored and managed by people. So how do we make people better understand all the things that are happening? And I think you’re right, we’re at a really interesting inflection point where we’ve built systems that require all these different views, require all these probes, all these windows and what’s happening, and there’s just an overwhelming amount of them. And so we are also now at the point where we’re able to crunch enough data to look at all of those mirrors and help us as people better find the signal in the noise. And just help us figure out what is the thing that we really need to look at.
Austin Parker: Yeah. And this is obviously monitoring in some way. You are monitoring some value, you’re monitoring some state for a change, or for it to cross some threshold. I wrote another thing where I argue against monitoring as a term, because like you said, you’re right, people are at the heart of this. People are the ones that built all those windows and probes, people are the ones that set those alerts, people are the ones that respond to them. So whenever you set on a monitor, or whenever you decide here’s something I’m going to get paged on, that’s a valued judgment about what’s important in your application. And it’s a way that you, as a human are saying, this is what’s normal. But you constantly have to be re-evaluating that. And I think this is one place that I see people get stuck a lot. Is we like to think that I can just go in and I can… We like to think of these as day two problems. Like how often do people sit down when it’s like, all right, we’re going to go build the next great application, or we’re going to go build this new feature, or we’re going to build something. And I feel like a lot of times we sit down and we say, okay, we’re going to use Kubernetes, or we’re going to use this cool new technology, or a library, or product, we’re going to do some vision, voice recognition, I don’t know. But also at that point, you need to be saying, well, how are we going to understand it? How are we going to adapt to the change that happens? Because maybe we do come out and say, hey, here’s our cool new app, we’re going to deploy a new feature into production. And we don’t think a ton of people use it, but what happens if some people do use it? All of those assumptions I made maybe at the very end when I was doing all the stuff I didn’t want to do like documentation, but we make those decisions about normalcy at the tail end using what we’ve known up to that point, but it’s not really a formal thing. It’s not something people are sitting down looking forward to and saying like, oh, I’m going to go write a bunch of logging statements today. Or I’m going to come up with the best solution for understanding application performance for my new thing. Usually we come to it later. And I think that’s the thing that needs to change.
George Miranda: I think there is some value there. I mean, nobody wants to sit down and think about writing a bunch of log statements, but there is a lot of value there because one of the things that’s difficult, especially when you have so many moving pieces to monitor and look at is to try to understand what the signal that this thing is giving you even means. And I think to use your analogy, we had a fire in a room one time and so we put a temperature probe in that room, and now we’ll get an alert that tells us, oh, it’s 50 degrees Celsius in that room. Was that good? Is that bad? Why did I even care? And so I think when it comes to simplicity, one of the things that’s really important to keep in mind is to just think about where you’re inserting telemetry and how somebody reasons about why it’s there and what the signal even means.
Austin Parker: Yeah. I think that’s a good… It goes back to your original container ship thing. Like how are they able to make something so complex as moving… Hey, how many tons are those boats? They’re certainly heavy. It’s design, right? It’s design and it’s thinking about these systems in a holistic way. It’s thinking, what are the critical things that people need to know? Because the most important part of the system isn’t necessarily the stuff you plan for, it’s stuff you don’t plan for. It’s the if something happened that we didn’t expect, how are we able to restore service or get back to normalcy or to a steady state? And you can see this in the design philosophy I think of a lot of, I don’t know if durable equipment is the word I want to use, but [inaudible 00:22:02] the space program, right?
George Miranda: Right. Super resilient systems.
Austin Parker: Yeah. Extremely resilient systems tend to have this in common where the information that you need is there and it’s presented in a uniform way. A good example, there’s a story I love telling about this. I was reading a book on human error and the title is called Understanding Human Error, I think the author’s name is Sidney Dekker. [crosstalk 00:22:27]-
George Miranda: Yeah, Sidney Dekker. It is.
Austin Parker: Okay. So I love one procreate that they have, and it’s talking about redesign. He was looking at a redesign of an airplane cockpit and the designers had changed the altimeter or the air speed indicator, I believe, from a dial where you have zero to whatever the max is and then a little needle that moves, they changed it from that to a tape. So there’s a straight line and then the current speed is always in the center. And the thesis was, well, it’s always in the center, it’ll be easier for people to read. The designers thought, this way when someone looks at it, they won’t have to visually ascertain where the needle is, they’ll just look at the middle and know that’s my current speed. What they found was people actually had a harder time understanding, like being able to read the airspeed, because what pilots had done was… What their brains were doing was looking at the relative position of the needle on the dial, and then they didn’t actually read the number. They just saw like, oh, here’s the position of the needle, and that lets me know what the speed is. When you’re in a high stress, let’s say situation like flying a plane and something goes wrong, anything these people can do to shave off time required to understand the state of their system is critical. And they found that this change that intuitively… Like intuitively I would say, oh, yeah, I know you’re reading the tape in the middle. Cool. Sure. That actually took cycles away that could be better spent on doing something else. And I think that’s the idea of [inaudible 00:24:01] resiliency and systems and resiliency and the way you’re designing your monitoring and observability has to be, find the things that are important. You can make some pretty educated guesses beforehand on the stuff that’s important, but just make sure that everything that’s important is available. Don’t try to get too precious, I think, about creating very complicated ways to interpret that data for you. If the information is all there and presented in a uniform way, then that lets people easily understand it and make decisions and try to quickly restore whatever normal is at the moment.
George Miranda: You know, that’s something we talk about a lot in incident response, which is, it’s better to be clear than it is to be concise. And it’s almost somewhat counterintuitive because you think if I take an acronym like the Incident Command System, for example, and I say ICS instead of Incident Command System, which is about three times as long to say, it seems like it’s going to shave off some precious time because people will just understand ICS quickly and understand what that means. But if you have somebody that’s unfamiliar to the lingo or doesn’t quite understand the context of these things, someway you’ve made it a lot harder for someone to understand what’s happening in an effort to try to shave off just a little bit of time. And it’s the same thing that’s happening in that story. We engineered for what we thought was a more concise solution, but it ends up introducing more obscurity than we intended.
Austin Parker: Yeah. And I think this echoes back to what I said earlier, simplicity is hard. Those are the sorts of things, the decisions that you have to sit there and make and reason through, to get a simple system. Because I think there is… Well, simple and easy aren’t connected, I think simple and good are. There’s a meme going around where the junior programmer is like, oh, no, you need to do this very complicated algorithm, and then there’s the more senior programmer that’s like, aha, nested for a loop, or nested if loop [inaudible 00:25:59] dating this episode already. But it’s true. I think one of the things you learn, the more time you spend programming computers, is knowing when to be clever and complex and when things matter and when they don’t. Because at the end of the day, I can look at nested if loop or nested if statements are nested for a loop and say like, oh, okay, I get what this does. Whereas if it’s some complicated algorithm, I’m going to have to go crack out Wikipedia or whatever to actually understand, what the heck is going on here?
George Miranda: So on that note, are there practical tips that you have maybe for our listeners on ways that you can start making choices that really simplify some of the systems that you’re with?
Austin Parker: Yeah. I’ve got a few. I think the first thing is a cultural thing. I think part of it is you have to incentivize understanding as kind of the… If you like thinking about these as like North Stars, then you need to have the ability to understand your system’s kind of like a North Star for both you and the people that you’re working with. Because if you’re building systems, be they complex or simple or anywhere between, you have to be able to quantify that simplicity, you have to be able to understand what’s actually happening to that. And I suggest that people look into modern observability frameworks tools like OpenTelemetry, OpenTracing, OpenCensus, to make it easy, to get broad-based sane, simple traces and metrics out of your application code, out of your service code. Because if you want to do anything involving knowing what’s going on with your system, you have to have that. I think a good example of this is people… Look at chaos engineering, everyone loves chaos engineering, and they love seeing what happens when they turn off for things at random. You need to be able to actually know what happened when you turned something off, you need to have that telemetry from your system. So incentivize the adoption of… I don’t want to say monitoring, but I do want to say understanding, promote tools that are open source, open standards, OpenTelemetry. And the last practical tip is don’t be afraid of pushing for simplicity when you can. Because I think there’s a bias towards, hey, we want to use the newest, shiniest thing, we want to throw in Istio or service mesh or whatever, wherever we can, but sometimes maybe you’ll just be fine with Nginx as a reverse proxy. Maybe you don’t need a fancy algorithm, maybe you just need a double for a loop. I think it can take a lot of courage to push for, let’s do the simpler thing, make it easy to understand, and then be ready to scale, than it is to maybe jump into the deep end. Obviously it depends on a lot of factors.
George Miranda: Austin, I think you’re trolling me because I have a lot of thoughts about the service mesh and how we deal with that [crosstalk 00:29:00]-
Austin Parker: I might be. You never know. Can we talk about that next time?
George Miranda: Exactly. So on that note, it’s been a really good conversation. Are there any things that you think our listeners should check out? Maybe an easy way to get started with some telemetry?
Austin Parker: Yeah, absolutely. If you want to get started with OpenTelemetry, best place to go is check out the website, it’s opentelemetry.io. We launched that into beta, so there’s a lot of good resources there. A lot of good documentation that explains what all this is, and also shows you how to very easily get started integrating it into your software. If you’d like to… [inaudible 00:29:35] The Plug, they do pay NISO, I had to shout them out, but my employer LightStep, we are [inaudible 00:29:42] platform built for these modern distributed systems, these complex systems, these deep systems as we call them. So you can actually try us out for free. If you go to lightstep.com/sandbox, then there’s an interactive sandbox that you can go in and play around with and actually see what our ideas look like in practice. About how to solve problems with your systems using LightStep.
George Miranda: Sweet. Well, we’ll put a link to that in the show notes. So one last thing before we go, there are a couple of questions that we ask every guest on this show. I’m going to start with the first one, which is, what’s one thing that you wish you would’ve known sooner when it comes to running software in production?
Austin Parker: How to write good logs.
George Miranda: What’s the value of writing good logs?
Austin Parker: So I’ll extend this. It’s the value of being able to write good instrumentation code and being able to write… This is a thing they don’t teach you. This is the thing that… And I think this is one of the… There’s a bigger picture. Part of the problem is how we mentor junior developers, but there’s very rarely… I feel that you don’t start out by thinking, hey, how am I going to actually understand what’s going on? What is a logging statement? Or what does a trace look like that’s useful to other people? Or what is a metric that is useful to other people? We talk about the how do you do this? But less the why? And I think that there is a difference between, here’s a logging statement that says something, but it could be better. Here’s a trace that tells me something happened, but it doesn’t give me enough information to really figure out the underlying state. So yeah, I think if I had known earlier about how to write good telemetry then that would have helped.
George Miranda: It goes back to just making sure that you understand where you can reason about, what these alerts that you want to have pop-up actually mean and putting it in the right direction.
Austin Parker: Absolutely. I’ve learned a lot about that since then, so…
George Miranda: Well lastly, is there anything about running software in production that you’re glad that we did not ask you about?
Austin Parker: Glad you didn’t ask me about running production software in Windows because I have locked all those feelings deep, deep away, but let me tell you-
George Miranda: That’s hilarious. I had every intention of asking you that when we first started because I did not know about your Windows background, but then we got into a really good conversation. And then I did not think about it, well, until now.
Austin Parker: Yes! The problem actually was not Windows, the problem was when I left Apprenda, or when I was leaving Apprenda, I swore to myself, next place I work, it’ll all be one or the other. It’s going to be all Windows or it’s going to be all Linux, no mixing of the two. Because the problem we had, one of my biggest challenges, one of the things that started the last day I was there was we had… Our platform could have both Windows hosts and Linux hosts for applications. However, the control plane ran on Windows hosts with Linux hosts as being workers. But because of that, it had to be installed from the Windows host, which means you needed to coordinate deployments between Windows and Linux. And this was long before Microsoft had decided to hack Linux, and I was having to do so many just awful hacky things with third party SSH and FTP and every protocol you can imagine to automate all this stuff. And there was no good answer. So yeah, I will say I think it’s gotten better since then.
George Miranda: Austin, it sounds like we could do a whole complexity episode just on that right there.
Austin Parker: We can do a whole complexity episode on Windows and Linux, but thankfully everyone’s just… Even Microsoft has decided to just use Linux now, so problem’s solved.
George Miranda: And on that note, Austin, I want to thank you for being on the show. It’s a really great insight into how we reason about complex systems. So thanks again for agreeing to be on. We’ll put a link to your blog in the show notes.
Austin Parker: Thank you. It’s great to be here and hopefully, yeah, I’ll get back out here and talk at you all again soon.
George Miranda: All right. Well, thank you very much. And signing off this is George Miranda wishing you an uneventful day. That does it for another installment of Page It to the Limit. We’d like to thank our sponsor PagerDuty for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter at page it to the limit using the number two. That’s @pageit2thelimit, let us know what you think of the show. Thank you so much for joining us and remember, uneventful days are beautiful days.
Austin uses the example of a platform team managing a large portfolio of applications that all need to share a common layer, like SSO. Sounds simple, but complexity only gets shifted around depending on your approach.
“The role of this platform team is to impose some simplicity onto this complex design that has been foisted upon them. We started coming up with some very complicated solutions to how to do different tasks for different systems. Until we stepped back and instead focused on what these systems had in common. Once we reframed into looking at the similarities instead of the differences, we were able to make a lot of progress very quickly.”
Simplicity is being able to reason about what is happening, even if what is happening is a very difficult and involved process. Just because something is easy, that doesn’t mean it’s simple. Just because something is simple, that doesn’t mean it’s not difficult to do. Simplicity can be very hard to implement when you see it that way.
“Breathing sounds very simple, but it’s not very easy if you think about what is happening in your body to make you breathe. It’s an incredibly complex series of physical biological systems: I’m not a biologist or doctor, so I don’t even understand them all.”
The microservices vs. monolith debate keeps popping up. Austin argues that microservices were never about managing technical complexity, they were about managing organizational complexity. The abstractions for microservice platforms are managing our organizational problems but at a high technical cost. It’s how we try overcoming that cost that might be making lives more difficult.
“On the technical side, that now means you have to be able to have a deeper understanding of your software: you have to be able to introspect the behavior of your application at every single point that it is used … At some point, we’ve covered it with thousands of probes … so much so that no one can even step back to see the application anymore, because we’ve lost sight of what it actually is.”
People are at the heart of these problems because they’re the ones who need to constantly re-evaluate system state. Austin makes the point that people usually fail to do that hard part of system design because it’s seen as a Day 2 problem.
“We make decisions about system normalcy at the tail end of design, using what we’ve known up to that point. Most people aren’t thinking about how they write logging statements to understand application performance at that point.”
We talk about “The Field Guide to Understanding Human Error” for the first time on this show! How did that take so long??? When it comes to understanding systems, it’s better to be clear than to be concise.
“The designers of an airplane cockpit changed the airspeed indicator from a round dial to a linear tape, figuring that it would be easier to quickly read airspeed if your target was in the middle of the tape… What they found was that it actually made it harder to understand and it took cycles away from pilots and added more response time whenever things went wrong.”
Experience is thinking about how humans interpret signals. It’s important to know when to be clever and complex and when it’s better to optimize for reasonability. Austin gives some practical tips to simplify choices when it comes to designing simpler systems.
“While simplicity and easy aren’t necessarily connected, simple and good kind of are.”
Austin wishes he would have known how to write good logs sooner. Very rarely do people stop to consider how to make metrics useful to other people. It’s less about how than why.
Austin is also glad we didn’t ask him about running production software on Windows based systems. The problem wasn’t Windows, it was dealing with heterogeneity.
Austin Parker has been solving - and creating - problems with computers and technology for most of his life. He is the Principal Developer Advocate at LightStep and maintainer on the OpenTracing and OpenTelemetry projects. His professional dream is to build a world where we’re able to create and run more reliable software. In addition to his professional work, he’s taught college classes, spoken about all things DevOps and Distributed Tracing, and even found time to start a podcast. Austin is also the co-author of Distributed Tracing in Practice, published by O’Reilly Media.
Austin is an international speaker, having presented to audiences in Europe and North America on topics relating to Observability and DevOps. In addition, he has led or assisted with workshops on OSS projects such as OpenTelemetry and OpenTracing at events such as QCon SF 2019 and QCon London 2020, and O’Reilly Infrastructure and Ops 2020. Finally, he has extensive experience speaking to diverse audiences in a variety of media formats through his podcast On-Call Me Maybe and his event livestreams such as OPS Live!
George Miranda is a Community Advocate at PagerDuty, where he helps people improve the ways they run software in production. He made a 20+ year career as a Web Operations engineer at a variety of small dotcoms and large enterprises by obsessively focusing on continuous improvement for people and systems. He now works with software vendors that create meaningful tools to solve prevalent IT industry problems.
George tackled distributed systems problems in the Finance and Entertainment industries before working with Buoyant, Chef Software, and PagerDuty. He’s a trained EMT and First Responder who geeks out on emergency response practices. He owns a home in the American Pacific Northwest, roams the world as a Nomad with his wife and dog, and loves writing speaker biographies that no one reads.