Julie Gunderson: Today, we have one of our summit special episodes and we’re joined by Tammy Bryant, @tammyxbryant on Twitter. She’s going to talk to us a little bit about her PagerDuty Summit session, building and scaling SRE teams. So don’t miss out on the full session that PagerDuty Summit, which is virtual and free this year. That is going on from September 21st to the 24th.
So Tammy, to get us started, do you want to tell us a little bit about yourself?
Tammy Bryant:
Hi, everyone, Hi, Julie. I’m Tammy and I work at Gremlin as a principal SRE, and so I get to do a lot of chaos engineering, we build chaos engineering as a service platform. Over the years, I’ve scaled SRE teams. So prior to working at Gremlin, I was the SRE manager for databases, as well as block storage and also code workflows, so developer tools. I’m Australian, I live in Florida, the most Australian place I could find in America. Yeah, I just love getting outdoors, being active, the beach, but I’ve always loved working in tech. I’ve done it for over 10 years now, and I’m really excited. I’ve been on-call for many years as well, pretty much 10 years, and have lots of fun on-call stories as well. But I love to sleep, if I can, 12 hours a night. So, I love reducing incidents too.
Julie Gunderson:
Well, thank you, Tammy. So to get us started today, Tammy, what excites you about building SRE teams?
Tammy Bryant:
So, one of the things that I really love to do is help engineers think like an SRE. So there’s not that many SREs out there in the world, but if you’re passionate about this space, it’s definitely possible to become an SRE. So, how do you think like an SRE? One of the most important missions of an SRE is to protect, provide for and progress the software and systems behind really important services that people use and rely on every day. So, you have to have this ever watchful eye on durability, availability, latency, performance, and capacity. So to me, I love it, because it’s a really important, critical role, and I’m excited about where we’re heading. In the future, I can imagine there’ll be a chief reliability officer role, just like you have a chief security officer role. To me, if your system and services aren’t up and running, then your customers just can’t even use your services. So to me, SRE is a critical, and that’s what excites me about it, just where we’re heading. The internet is so important, especially right now. Obviously with this pandemic, it’s the way that we can talk to each other, we can connect, we can actually get access to food and our family and our friends. So, that’s what really excites me.
Julie Gunderson:
Have you seen any changes into the way that you’re operating or that others are operating because of the pandemic?
Tammy Bryant:
Yeah, definitely. So just in terms of our customers, at Gremlin we’ve got some customers that are doing, for example, curbside pickup. That’s a new thing that they had to prepare for and get ready for, and so we helped make that more reliable, made sure that customers were getting a good experience. Because obviously, folks show up, they want to be able to get their item really fast and then get back home, so you need to be able to service that transaction really fast. Then we also have a lot of banks that are our customers. Obviously now, for example in countries like Australia, you can only leave your house in Victoria for an hour a day before 8:00 PM, and apart from that, you’re locked inside your home. So in that hour, you need to be able to go to the ATM, get money out, pay any bills that you need to pay, buy food and groceries for your family. If ATM’s aren’t working, you can imagine what a nightmare situation that is. That recently just happened. I saw there was a big outage, Commonwealth Bank of Australia had one just the other day. So to me, yeah, I’m seeing a big change, it’s just completely different. When you only have a very small window to do something that’s critical, it makes it even more important that it works and that it’s reliable.
Julie Gunderson:
Absolutely. That’s a question that we commonly get asked as advocates, to explain SRE and the importance of SRE, and then how do you either embed SRE in your team or work with the SRE teams? Do you want to touch on that a little bit?
Tammy Bryant:
Yeah, sure thing. So SRE is relatively new for a lot of organizations, and I’ve spoken to a lot of our customers and they’re building out SRE teams as well. To me, the most important thing is to figure out, yeah, who is responsible for reliability within your organization? Because if no one’s responsible for that, then I think that’s a dangerous situation. It’s really important to make sure that someone owns that responsibility and they care about it and they elect the steward for that, they’re the guardian of the reliability of the services. To me, that’s what an SRE org is all about. So if you think about that, yeah, who’s responsible for uptime, who’s responsible for making sure that data is never destroyed or lost? When I compare it to say, for example, security engineering, security engineers care if you get popped, care if data is stolen and leaked. I used to be a security engineer at the National Australia Bank for a while, and so you go, “Okay, well, that’s obviously very bad if data is stolen, but what happens if data is just completely wiped, completely destroyed, because backups don’t work?”
This is why I also love chaos engineering as a part of my role as an SRE, because I’m injecting failure on purpose to actually make sure that, for example, backup restores work accurately all the time and you have data consistency. But yeah, it’s definitely a really important role, and I think back in the day, we could probably get away with it. You could be like, “We’re having maintenance this weekend, you can’t use our service all weekend.” I don’t think people are happy with that anymore, they’re just not okay with maintenance hours or maintenance windows, so we really need to move away from that.
Julie Gunderson:
That’s funny. This weekend I tried to use my debit card at the gas station and my bank was down, because it was a maintenance window. A six hour maintenance window on a Friday night.
Tammy Bryant:
Yeah, that’s just not okay anymore, it doesn’t make sense. It’s your money, you should be able to access it at any time. I’m really passionate about that, I just think that’s also a way to compete. If you’re a bank that doesn’t have maintenance windows, then, yeah, people are going to be more likely to want to use your service, because they know that they can get access to their money at any time. I think for me, it’s a competitive advantage. These days it’s really easy to change banks or service providers, so people will do it if they can get a better service.
Julie Gunderson:
Absolutely. I know we’re kind of shifting gears here a little bit, but I’d love to hear how do you talk to, let’s say financial institutions who seem to be very reluctant to practice chaos engineering? How do you talk to organizations that have a lot of fear around this?
Tammy Bryant:
So, one of the things that I used to do in a past life is I worked at the National Australia Bank for about six years, and while I was there, we started to use Chaos Monkey from Netflix, because we were moving to the cloud, we were moving to AWS for nab.com.au. We wanted to make sure that we were ready to handle that, because before that we’d basically just done on-prem. So that was a really exciting time to be there, and I got to work on that AWS cloud migration as well. But to me, that made me realize, okay, yeah, as a whole company, we need to be prepared for this. We need to think in a different way, we need to be ready for this change. It’s totally different, we can’t be responsible for that hardware. What happens if they just decide to reboot machines or they need to do a hardware upgrade, or they need to do a firmware upgrade, kernel upgrade, whatever it is? You don’t get to pick usually the windows of time that’s going to happen, and even if you can, it’s still a lot to prepare for. It’s better to just build your services in a way that they can withstand that failure. So yeah, everyone at the National Australia Bank just got it straight away, which I feel like NAB is a very innovative bank, and they recently wrote about how they’re doing chaos engineering on medium as a observability of chaos engineering post that they just published. But when I’m talking to other banks in the US, I like to tell stories of banks who’ve done chaos engineering in the past, and just to talk through those scenarios. Would you rather figure out your failure modes in the middle of the day and have your reliability teams be ready to handle those and focus on improving meantime to detection for incidents and meantime to resolution? Or do you want to just leave it up to luck and have folks be paged in the middle of the night, have no metrics around who’s being paged, how long they’re working on incidents for? I think actually what was happening in the past is a lot of engineers were on-call, and a lot of people at their company didn’t even know they were getting paged and woken up in the middle of the night.
That’s an interesting thing that I’ve realized over the last few years, there’s not too many people that actually know that information. But yeah, that’s a hard life to be constantly getting paged, leads to a lot of folks then being tired the next day. It’s really hard to be able to do your job, plus be woken up in the middle of the night constantly. So telling those real stories and getting engineers to just share, “Hey, this is what my life is like, I’d like it to be a lot better. If we could focus on just improving this, it would be awesome.” Just some little examples too, of, “Hey, have you ever had a time where the right person didn’t get paged when a service went down?” “Yeah, everyone has those stories.” “Then did it take you ages to figure out who should get paged?” “Yep, that happens a lot too.” It’s a really easy thing to fix actually, but you need to do the work proactively to make sure that you have all of that set up correctly, and that you manage it too continuously to improve it. So yeah, I’m really excited about the idea of using chaos engineering to improve incident management. I’ve been a PagerDuty user for a really long time now, I think over six years, and it’s just a really cool way to make sure that you have everything set up correctly. My personal goal is being able to detect incidents in five minutes or less. I just know a lot of people don’t detect incidents for hours, days, and the worst I’ve seen is years. So yeah, it can get really bad.
Julie Gunderson:
Well, Tammy, I love this, and it sounds like we’re going to have to have you back for another episode to talk about some of these things. Now tell us a little bit about your summit talk, which is the building and scaling SRE teams.
Tammy Bryant:
Sure thing. Yeah, I’m real excited to give this talk. So often I get pinged and people say, “Hey, you built SRE teams in the past. How did you do that and where do you find SREs from?” So I’m going to share some of my secrets, like my secret places where I hire SREs from. There’s one school that I’ll talk about that actually trains people to become SREs, and I hired some of their graduates. There’s also a few places in the world that you can find amazing SREs from. Then I’m going to talk too about how you set your SREs up for success, how you align them to your core values within your company, the types of metrics that you should be tracking when you establish your SRE team and how you can make sure that your SRE team is always helping your company move forward and delivering really positive, impactful results. I think that’s what it’s all about. Hiring great people is the first key and then making sure that you set them up for success, and measure their success and then tell the great stories of what they’re doing. That’s what it’s all about. So, I’m excited to share that.
Julie Gunderson:
Can you give us one little sneak peak maybe on how you measure success or even who the right people are?
Tammy Bryant:
Yeah, sure. So there’s a few different ways that I like to measure success, but one of the key things that I think is important is to figure out what are the core goals that you have as a business. So for example, are you launching a new product? Do you need to have your SRE team focus on that or do you have a high amount of incidents that you need to reduce? It really depends on where your biggest pain points are as a company. So I’d say to everybody, first think about what keeps you up at night? What’s the thing that you’re most worried about? Is it that you want to make sure that when you launch a new product, it goes perfectly smoothly? Do you have a big press release lined up? Do you have marketing campaign dollars? A lot of people spend a lot of money on that, of course you would like it to work. So, that’s the first thing that I like to focus on, or if you have a lot of technical debt, you can have your SREs focus on reducing that. But that to me is the key thing, and then I like to actually say, “Hey, set some small goals. Don’t try and put together a 12 month roadmap, just try and get some small wins in first in the first three months.”
Something that I did at Dropbox in my first three months was get a 10 X reduction in incidents for the databases team, that then freed us up to be able to work on other projects. Later we worked on the release of Dropbox Business and we went from, I think it was 200 million to 400 million users in a year. Which is obviously a lot more people that you have to be able to manage, in terms of capacity planning and traffic and load and different usage patterns. But reducing incidents gives you the time and ability to be able to make sure that that works really smoothly. So, that’s my key thing I like to say. Also, my other tip is don’t go too big too fast, so start off by having a small team. I think you only really need two to five SREs to get started and someone to lead that team, I wouldn’t hire 20 SREs just to get started or transfer 20 people over into an SRE team or change their role title. I would just start small, get some wins and then gradually expand the great work that you’re doing.
Julie Gunderson:
I love that. I love that starting small advice too, it’s one of the best ways to do it, right? Well, Tammy, in just our last little minute left here, being a PagerDuty user for so long, do you want to share one of your on-call horror stories?
Tammy Bryant:
Yeah, sure. One of the biggest incidents that I ever worked on probably was the big S3 outage. When that happened, that went for many hours. So I was the incident manager on-call for all of Dropbox, and that was a five hour outage, I think, where most of the internet went down, because everyone was relying on one region of S3. Even for us at Dropbox, we had an issue with thumbnail previews not working at the time, so we had to get that resolved. We did get that resolved while the incident was actually happening and everything actually worked out fine, we were able to get out of the incident after then that region was down. But in terms of that situation, what I think it really told a lot of folks is, yeah, don’t rely on just one region for your data storage, make sure that you have redundancy, make sure that you can do failover. This is all what reliability engineers are all about in terms of chaos engineering, and then you want to proactively practice that and prepare for that and make sure that you’re ready. We know that these horror stories do happen. That was just the one that took down most of the internet, so it’s one that I greatly remember.
Julie Gunderson:
Absolutely. That’s one that I think I include in my chaos engineering talk and a few others, because I think almost everybody remembers that, right? Well, Tammy, it’s so great to have you here, we’d love to have you back to talk about some of these things in more depth later on. But for everybody else, just to remind you, PagerDuty Summit is going on right now. So Tammy’s talk will actually be tomorrow, please feel free to register. It’s free, it’s virtual. If you can’t make some of the talks, if you’ve registered, there’ll be available on demand. Additionally, Chaos Conf run by Gremlin is October 6th through the 8th. So that’s also a free conference, PagerDuty will be presenting as well. I’ll be there along with the community team and we’ll be talking about responding to chaos with PagerDuty in an interactive workshop. So, hopefully we’ll see you all there. Thank you for being with us today. This is Julie Gunderson wishing you an uneventful day.
That does it for another installment of Page It to the Limit. We’d like to thank our sponsor, PagerDuty, for making this podcast possible. Remember to subscribe to this podcast if you like what you’ve heard. You can find our show notes at pageittothelimit.com and you can reach us on Twitter @pageit2thelimit, using the number two. That’s @pageit2thelimit. Let us know what you think of this show. Thank you so much for joining us, and remember, uneventful days are beautiful days.
See a transcript of this episode by clicking the Display Transcript button above. Also, any links for additional resources mentioned during the episode are listed below.
Tammy Butow is a Principal SRE at Gremlin where she works on Chaos Engineering, the facilitation of controlled experiments to identify systemic weaknesses. Gremlin helps engineers build resilient systems using their control plane and API. Tammy previously led SRE teams at Dropbox responsible for Databases and Storage systems used by over 500 million customers. Prior to this Tammy worked at DigitalOcean and one of Australia’s largest banks in Security Engineering, Product Engineering and Infrastructure Engineering. Tammy is also the co-founder of Girl Geek Academy, on a mission to teach 1 million women technical skills by 2025.