eBPF and the Observability Landscape with Shahar Azulay

Transcript

Mandi Walls: Welcome to Page It to the Limit, a podcast where we explore what it takes to run software in production successfully. We cover leading practices used in the software industry to improve the system reliability and the lives of the people supporting those systems. I’m your host, Mandi Walls. Find me at LNXCHK on Twitter.

Alright, welcome back folks. Welcome back to the show. This week I have with me Shahar Azulay, who is the CEO and co-founder of groundcover, and he’s going to be telling us about what they do and a bit about ePBF, which we haven’t covered on the show before and I haven’t been keeping up on. I heard about it a number of years ago and it’s a little bit outside of what I do. So we’re all going to learn some things today with Shahar. So welcome to the show. Tell us about yourself and what you do.

Shahar Azulay: Great. Thank you for having me. Great to be here. So as you said, I’m the CEO and Co-founder of groundcover and basically groundcover is a full stack observability solution for cloud native environments. Built on kind of two major different approaches that we can definitely touch on. One of them being eBPF, which is a very modern, interesting way to monitor systems at scale without working too hard, which is something that developers usually like. Right? Thumbs up for not hard.

Mandi Walls: Oh yeah.

Shahar Azulay: So powerful coverage, but yet very easy. And on the other hand, a very unique way of how we store the data process and maintain the data and the cloud environment of the customer rather than save it on our end as an observability vendor. So kind of both combined is what ground cover is about and great to be here.

Mandi Walls: Awesome. All right, well let’s dig into eBPF to start. I’d imagine at this point folks have probably heard about it, it has been in the marketplace for a few years anyway and is starting to get more commercialized. I guess we could say there’s a lot more folks dipping into it and providing it as a service or at least guidance that way. Give us your perspective on it, what it does for folks, how it sort of works and what kind of benefit it provides.

Shahar Azulay: I think eBPF is one of these things where it was kind of clear that it’s going to be dominant but not clear how fast maybe, but as you mentioned, it’s not new. It’s great technology take time to bake all the way through. So it’s been with us from the days of tcpdump and things like that, basically allowing you to do sophisticated stuff with the help of the kernel in case that you want to do stuff that requires some heavy lifting. Over the past few years, eBPF has kind of transitioned it to what we see today in a sense that it becomes just more generalized. Basically, you can actually run complex logics or programs basically if you want, inside the actual Linux kernel and that has multiple use cases, one of them being observability as a way to sit inside the Linux kernel, which on top of containers are floating around and monitor whatever they’re doing or security in a case that you want to collect very intimate security events and all the way through to networking and network policy control and stuff like that. So it’s definitely very dominant. We see it in cloud providers and security vendors just a few years since groundcover started. It’s pretty clear that it’s going to be with us for a long time. So that’s kind of eBPF in nutshell.

Mandi Walls: Yeah, absolutely. So since there’s such a wide variety of things that it does for folks or can do, what do you kind of see as those best practices or things that folks use it to get started with is sort of dip their toe in and figure out how they want to use it?

Shahar Azulay: So I think that one of the first things that we see eBPF as being very dominant looking forward is things around tracing. Basically tracing, tracing or APM is kind of how we used to think about it, is something that over time became just more and more complicated, more microservices, more languages to program in, democratizing that approach and eventually you end up in a place where a company considers open telemetry or some kind of way to get into the code and dip their toes in instrumentation to actually get tracing. Although as a community, we hope that will get easier over time. It actually got worse because back then in Java, .Net things like that, it was a bit easier. These maybe cumbersome yet very VM oriented languages back then allowed us to do these out of the box experiences. So weirdly, what was easy 15 years ago maybe isn’t that easy now, right?

Shahar Azulay: With Go and Rust and a lot of these more already efficient oriented languages. So eBPF unravels that back again. We wanted back then in the Java and .Net era and we can get tracing out of the box, which is very interesting. Basically sitting in the Linux kernal allows you to observe network interactions between microservices and you can already get the tracing layer and the application monitoring layer without doing anything inside the code. And in a vast cloud native environment, that’s really magical experience because otherwise you would have to instrument 15 different teams inside the organization would have to instrument their code. So that’s kind of one of the most strong use cases we see.

Mandi Walls: Yeah, it’s kind of rebuilding sort of a logical monolith out of the things that we’ve devolved out of the original monolith. Like you’re saying that Java, .Net, these massive applications that everybody worked on and everyone had access to all the libraries and everything was connected and now we’re all doing our own microservices and we need a way of building back in what that experience looks like across all the crazy stuff that’s now going on.

Shahar Azulay: Exactly. So we see it as exactly as I say, the generalized layer to collect the data from all these different flavors and eventually you shouldn’t care about it. I think we’re also in the era where we’ve formed new roles like DevOps and SRE and observability oriented roles and eventually they have to go and chase down these r and d teams and say, instrument your code, I want a unified experience. It doesn’t make sense in bigger organizations, it doesn’t make sense and I think eBPF also provides them that tool to actually get done what they need to be in charge of and I think that’s kind of where we lost that connection. I can’t be in charge of monitoring if I have to go back to the r and d team and get it into sprint X in three weeks. It doesn’t really make sense from an ownership perspective. So

Mandi Walls: Absolutely it’s hard from an organizational and maybe even political perspective internal to your organization, how much power and authority does your downstream reliability team have versus the product manager and the application team. We’re like, we got feature work to do, man, we got other things going on, we got to ship other stuff. So it’s a constant battle back and forth. Yeah,

Shahar Azulay: Definitely.

Mandi Walls: So what does groundcover bring to the game on this side?

Shahar Azulay: So in a sense, eBPF is an open source technology. It’s been pushed by the current community. It’s not ours to own, but using it properly for the observability use case is very complex. It’s not an off the shelf approach. You would say that a Java agent back then is an off the shelf. You have to do things that make sense for a specific use case like observability, collecting logs, metrics traces through eBPF in a very cost effective and performing way. So that’s one thing that groundcover does really well. Taking advantage of eBPF to the maximum to get you all the information that you can get without doing any instrumentation inside the code. I think we’re probably one of the more interesting teams right now in the world doing that, but it’s definitely moving there. So hopefully we’re going to see more and more absorbability company turning that way as a collection method way.

Shahar Azulay: So that’s one. And the other things, the other thing we’re doing very differently is basically claiming that regardless of being the eBPF company and collecting data a bit differently, something is a bit broken in the model of how we store the data and price the data, because eventually eBPF is just yet another way to get more intimate and more easy to get kind of data and then you’re back to square one. How do I pay for this volume of data being collected, which again, I can’t control as an organization. Every developer creating a log line eventually propagates into my bill at the end of the month. So groundcover regardless of eBPF, kind of reinvents that structure with what we call in cloud, which is a data plane that resides on the customer end while providing a SaaS control plane kind of approach. So you get the same SaaS UI experience of sharing and working and collaborating through a SaaS experience yet data plane is running on your cloud premises being managed by us for scale and updates, yet you pick up the bill at the end of the month to AWS, GCP, Azure or whatever and it kind of allows us not to price by data volume anymore.

Shahar Azulay: It kind of brings us back maybe to the open source approach of I know that storing a Prometheus on-prem is cheaper than storing it in a cloud vendor like Grafana, Datadog, whatever, but I don’t want to maintain it at scale. So taking away that edge, maintaining it for you yet on your premises, that’s kind of what we do and it’s a technology on its own that we call incloud.

Mandi Walls: Oh, that’s super interesting. As in that’s a layer of transparency. Like you said, it’s often missing when you’re just kind of putting all of that stuff into the vendor’s world and you’re like, okay, well here’s my bill at the end of the month. I have no idea what this actually represents. And if you’re paying for the storage yourself, so make those hopefully more competent decisions around the value that you’re getting out of that data, based on what you’re actually then paying for it and exactly what that looks like. That’s super interesting.

Shahar Azulay: Basically as an organization, you’re already purchasing compute and storage from some cloud provider. So in a sense you already know how to purchase this if you have some kind of discount plan or commitment to your cloud provider, you’re already doing that. So kind of bringing that to your world and how you want to purchase compute and storage rather than purchasing it to us on a premium. That’s kind of what we’re doing and it can cut costs dramatically and also be a very predictable pricing model because you take off the volume from the equation, it’s easier to predict how much it’s going to cost next month, and I think that a lot of teams are already kind of fearful from their observability vendors about opening that bill end of month. We definitely don’t want to be there, don’t want to be that vendor. So that’s kind of where we’re at.

Mandi Walls: Yeah, for sure. OPEX is a scary place sometimes at the end of the month for sure. What are they charging us for what, oh yeah, so these tools combined. What do you see your customers, the benefits that they’re getting, what are they sort of after as far as what it’s going to do for their products or their customer experience or those sorts of, what’s the end goal usually for them?

Shahar Azulay: How we see it is basically the two methods combining to something which is an enabler rather than something that can cost cut costs or whatever. Because basically on one hand, one hand you have the inCloud approach that can eventually allow you to store data more effectively and the other you have eBPF that allows you to collect it more powerfully. Putting that together is an enabler in a sense that suddenly you can store more observability data rather than fighting that battle of storing all the error logs or sampling traces heavily or reducing metric cardinality to make sure that you’re not choked on your quota or whatever. So that together is an enabler in most cases. We see a lot of customers not monitored lowering staging and dev or doing some other decisions. So allowing them to collect data really, really easily and also take away these trade-offs is usually allows ’em to eventually store more data they can use for troubleshooting for eventually downtime recovery and stuff like that, which at the end is what we all trying to do to make sure that we have the right data at the right place.

Shahar Azulay: To be honest, a few years into this journey, we’ve seen things that I couldn’t imagine before. Companies having a daily log quota, you come over the morning after a downtime at night, no logs to troubleshoot an incident, and companies live in that situation with a lot of vendors they work with today. So I think you can’t do just one of the other anymore. It doesn’t make sense for a longer run just providing more cool tracing solution or just stuff that people didn’t have before. They have too much and they’re already making a lot of trade-offs every day fighting these battles. So we’re trying to think of both ends. So I think that’s kind of what allows us to eventually move customers from current solutions into groundcover, not just being one of the other. Right. Eventually they’re already live in this complexity, we’re trying to live it with them.

Mandi Walls: Yeah. Oh, absolutely. It’s easy to get to a point where you feel like you’re drowning in data, but at the same time you’re afraid of getting to a point where like you say, you don’t have exactly what you need when you need it. You deploy something new, you’re not exactly sure how it’s going to behave in production and what kind of telemetry it’s going to push out that will be helpful when you need it and when there is an incident or when users are starting to do something a little bit unexpected that causes the software to behave in an interesting way. And that trade off is, it’s important, it’s a key thing. You’re paying for storage or whatever of these things that you might need as a risk abatement for the possible problems in the future.

Shahar Azulay: Yeah, exactly. And eventually you have already evangelists inside every organization doing FinOps, thinking about observability scales. You’re burning a lot of fuel, basically trying to make sure that you’re in budget with these solutions because they’re definitely spinning out of control. The average of the realty vendor is charging about 20% of what you’re paying. You’re a cloud vendor. That’s a pretty hefty, that’s eventually, that’s kind of the second bill in line for the R&D organization, so it’s definitely a big decision for this company. So they invest people working in that trade-offing all day going back and forth into what they actually need to store, and that’s kind of a place where people make mistakes or eventually pay some price, and hopefully we can limit some of that, right? It’s not endless, but we can eliminate some of it.

Mandi Walls: Yeah. Oh, for sure. Yeah. Well, it’s weird. Sometimes I equate it to when you watch a lot of crime shows on TV and they go to some store and they went, well, do you have all these video cameras and what’s the recording? You’re like, oh, we only keep 48 hours, and you’re like, that hardly seems like enough. But at the same time, we’re all in the same boat. How much data do you have? How much can you keep that is easy and convenient and accessible and on top of that that you know how to use it if you need it. There’s just so much of that that comes across there for sure.

Shahar Azulay: Yeah. Yeah. I mean it is definitely not good for the planet, the number of logs we’re creating per day.

Mandi Walls: Well, I mean that’s a whole other story. The efficiency and the sustainability of some of these, the things that we’ve put together and as we’re looking into using more AI models and those kinds of things and knowing just how much resources those kinds of models are requiring, that is a whole other sort of ball game. I’m not sure how much people are looking into that yet.

Shahar Azulay: I mean eventually it’s kind of fighting against the market in some cases when you’re trying to do that. I mean, in a sense, we all agree that it doesn’t make sense. You want to come over to each R&D organization and say, do you really need all these metrics? Do you really need all these logs? But eventually it gets something that you can’t change in most cases because they’re not sure how they’re creating this data, they’re not sure how to govern it, and at least we’re trying to work with it in some way and say, okay, if you have to store all that, let’s find a more cost effective way for you to pick up the bill not directly to us. I think we’ve seen a few vendors over last few years in the market trying to educate the market. Do you really need logs?

Shahar Azulay: Do you really need that or that It’s hard to be honest, right? People don’t really know how much data they’re generating or where it’s coming from, and they just want to make sure it’s safe and someone knows what to do with it. I mean, take a look at compliance. It’s completely off the chart storing years of data that no one wants to look at. Just put it in some glacier. I don’t want to look at it. So I think we’re finding a way to work with it because fighting against it is currently not doing the job. It should come as a second layer once you’re already in, give me ways to filter streams or figure out what’s actually being used, what’s actually being queried. But I think it should come as a second layer after things work because customers don’t really know how to make this decisions at the gate.

Mandi Walls: Oh, sure.

Shahar Azulay: Oh sure, no problem. Just throw the stream. I don’t need it. No one tells you that. No one, no,

Mandi Walls: No. I’m like, oh, we’re all data hoarders. At the end of the day, there’s so

Shahar Azulay: Much we want a warehouse.

Mandi Walls: Absolutely do. We want all at the end of the Indiana Jones. We want all of the stuff in the warehouse in case we might need it someday for something amazing. Yeah. Oh my gosh. So what kinds of things are you looking at for the future along the lines of how folks are making better use of this? So you see a future where we’re not necessarily even using logs. I am an old school sysadmin, that was kind of all we had. So moving away from that can feel very scary, but we have all these better tools now. eBPF being one of them where we’re getting more real time data and better tracing off the systems versus what the developers decided to kick off through log files. What’s the future maybe look like on some of these things?

Shahar Azulay: So I mean from a philosophy perspective, there’s a lot of philosophy around it. Is there a fourth pillar of observability? What is it? Whatever. I’m a bit more down to earth when it comes to thinking about, because I think logs isn’t going away. It’s one of the only human generated inputs that we use for observability and people love to see their own text customer did x whatever. It’s kind of more tangible than anything else, like to any other metric. But I think that what we’re trying to do as, and I think that’s kind of the modern approach and every app is part of it is at least breaking those tiers. I mean, these old products or legacy products used to work until now, right? In tiers, oh, you want infrastructure monitor, there’s an agent for that. You want logs, there’s a log collector for that.

Shahar Azulay: You want traces, there’s an SDK for that, and you buy each of them separately because they’re even collected through different technologies. They’re not necessarily the same thing. They evolve over time. So these tier I think is where the problem is not rather using logs or not because most companies don’t have tracing. I mean eventually most companies are log-based, have hundreds and thousands of alerts and logs generating millions of logs that can easily be metrics. Let’s go educate them. Eventually what we want to do is make sure that they don’t think about it in tiers of logs, infrastructure monitoring, activating APM, and just give them everything out of the box and then create this more coherent experience where maybe logs are part of the solution and not the only thing they’re using. But I think that’s mostly what we see in the market. People think that everybody’s using APM, Open Telemetry.

Shahar Azulay: It’s really a fraction of the market because it’s so, so hard, and therefore we’re already inventing all these great methods to troubleshoot, but they’re accessible to a very small part of the market. It actually uses all of them for reasons of cost, complexity. There’s a million reasons, but the reason people turn into logs isn’t because they’re just addicted to it. It’s more about it’s the easy way to get stuff. If you’re a basic DevOps shop and you don’t know too much, start there and then think about metrics, traces, whatever. We want to make sure that we break that barrier at least. So more people get into tracing and more application oriented monitoring.

Mandi Walls: Yeah, absolutely. And for larger shops that have a mix of let’s say mix of vintages of their applications, some of that older stuff is probably never going to get touched and updated to the point where no one’s going to go back in there on the engineering side and re-instrument that for modern tooling, there’s always going to be older stuff that that’s sort of sitting around that’s making you money. It’s out there is being important, but is not in a place where you have the time or the resources to bring it up to speed on what we’re doing now in 2024.

Shahar Azulay: That’s one of the fun parts about eBPF because you don’t really care about it. I think that legacy code that monolith, you’re going to break into microservices next year, there’s always that. And to be honest, I agree with you. I mean in most cases it works. Why touch it, right? Yeah. The guy that wrote the code is already surfing in the Caribbeans, and

Mandi Walls: I was say he’s retired to an island

Shahar Azulay: And

Mandi Walls: He is long gone.

Shahar Azulay: Yeah, for sure. I think that part of it is also helping in these situations or third party code. It’s not like we’re already running just mal, it’s just our code in production. There’s like a million different third parties that you can’t instrument, even if they’re modern. So kind of equalizing that ground as well is also important.

Mandi Walls: Absolutely. Yeah, because everybody’s got some payroll or time tracker, Java application that was written by an outsourced company 10 years ago that they got bought, got bought, and there’s not even a contract on it anymore, and you’re just like, well, it still runs, so we still use it and

Shahar Azulay: Yeah

Mandi Walls: Out there making important stuff internally. Yeah, absolutely. So for folks who want to get started with any of this, what kinds of recommendations do you have for those folks?

Shahar Azulay: So eBPF again is open source read about. We have a great domain coverage about eBPF inside the ground cover website. If you want to go to groundcover.com. And if you want to experience how it looks like in a product, there’s a free tier to groundcover and also playground that you can just check out to see how powerful it can get and how deep it can get before making your assumptions of what can work for you. Just experimenting with it, throwing on an eBPF on your environment and see how these sensors can do something else. So feel free to go into groundcover.com and check it out, and we have community slack and things to help you out if you want to get started.

Mandi Walls: That’s awesome. Fantastic. We’ll link to groundcover in the show notes so folks can find you guys quickly. And one question we sometimes ask on the show, is there any myths that you find people have misconceptions that we can set folks on the right path? You hear anything when you are talking to folks in the community that they have the wrong idea of that you might want to set straight?

Shahar Azulay: Yeah, I think one of the things that people just underestimate is the fact that once you instrument an SDK into your code or even do this work, which we really appreciate the Open Telemetry project, and I think that at least the standard is more important than the code itself. Kind of standardization is definitely important for the community, but people really usually underestimate the effect of instrumenting your code. In some cases we see a dramatic decrease in response time of replication because there’s a piece of code that is right now running on each request you’re doing instrumenting your code. And I think that it’s kind of an unknown variable and everything that you do because the only way to measure it is like A/B testing kind of rip out that SDK bring it back in. So as part of the eBPF revolution, I think that a lot of people are going to see that performance of these SDKs is something that we sometimes surprisingly, surprisingly bad and are going to be more and more concerned about it because as time passes, there’s more and more time sensitive applications like payments applications or health applications that are time sensitive or cybersecurity applications or whatever that eventually want to measure their SLA in front of their customers and don’t want to instrument foreign code that might slow them down and do things like that.

Shahar Azulay: So I think that’s one of the interesting part that I think people are not aware of because we’ve all been doing it for a decade, which is kind of interesting to think about.

Mandi Walls: Yeah, oh, for sure. Oh yeah, absolutely. We definitely do come across folks that are looking to that as they’re implementing our stuff too. Anything to get them to a place where they have knowledge faster, can work through their incidents quicker and reduce their downtimes or whatever, and then some of their solutions are sending them in the wrong direction. Just some of it takes a lot more expertise than maybe they have on hand. And like you said, some of them aren’t necessarily built for the kind of speed that a lot of applications require these days. So yeah, super interesting all around. Awesome. All right. This has been great. Is there anything else we should share with people before we go?

Shahar Azulay: Groundcover is going to be in KubeCon and re:Invent. Would love to see folks. If you guys are there and want to hop on and say hey, then definitely welcome.

Mandi Walls: Excellent. PagerDuty will be at both of those shows as well, so yeah, absolutely. Stop by both our booths, say hello. We love to talk to folks who are working with all these tools and how they fit together and yeah, sure. This has been great. Thank you so much for joining us this week.

Shahar Azulay: Thank you so much for having me. It was great. Awesome.

Mandi Walls: For everybody else out there, we’ll wish you an uneventful day. We’ll be back in a couple of weeks with a new episode. That does it for another installment of Pager to the Limit. We’d like to thank our sponsor, PagerDuty for making this podcast possible. Remember to subscribe to this podcast. If you like what you’ve heard, you can find our show notes at page it to the limit.com and you can reach us on Twitter at page it to the limit using the number two. Thank you so much for joining us and remember, uneventful days are beautiful days.

EBPF and the Observability Landscape With Shahar Azulay

Transcript

Show Notes

Additional Resources

Guests

Shahar Azulay

Hosts

Mandi Walls (she/her)