An Interview with Úlfar Erlingsson
A few weeks ago, we sat down with Úlfar Erlingsson, who recently joined Lacework, our investment in cloud security, as Chief Architect. We wanted to learn more about Úlfar (the man behind the cloud) and about the Polygraph, Lacework’s novel framework for automating anomaly detection. Not only is Úlfar (formerly of Apple, Google, and Microsoft) maybe the world’s leading architect and researcher in security — he also speaks in fully formed paragraphs and is a devotee of James Joyce.
The global cloud security market is currently valued at $35 billion and rapidly growing. There are a number of active players in this space; what distinguishes Lacework among the other cloud security offerings?
Lacework is unique in my mind, not just in cloud security, but in general with respect to how security companies manage to build a product that brings value.
There are two diametrically opposite approaches to how you secure software activity. One of them is to build rules to recognize bad activity — bad things that could happen — and throw alerts when activity that matches those rules is detected at runtime. And that’s pretty much how the entire security industry has worked ever since the 1980s. The first antivirus vendors were doing signature-based antivirus and ever since all successful software security companies have taken much the same approach. However, maintaining such a list of rules is a Sisyphean task, and one which leaves defenders always one step behind attackers.
The opposite approach is to actually understand the customer’s workload, understand the customer’s infrastructure and what it’s supposed to be doing, and enforce security by just making sure that only those things that are supposed to be happening are actually happening. The general name for this approach is behavioral anomaly detection, whereas the approach traditionally taken by most security companies can be called rules-based intrusion detection. By learning what is normal, the defenders can build value by understanding what is necessary for the business, and simultaneously defend against anything unusual, including zero-day attacks.
For many reasons, in the past, learning what constitutes normal software activity has been incredibly difficult, for even one customer, and has been non-scalable, such that an equal amount of work has been needed for every new customer. Despite this, because of the inherent advantages of the anomaly-detection approach, there have been countless waves of research and industry attempts trying to do it, but those waves have never landed successfully.
However, Lacework has cracked the code: we really have managed to build a top-tier security product based on this approach of accurately understanding what’s normal in our customers’ environment, and reporting deviations from that sense of normality.
This is a fundamental shift in how security is done. It is a fundamental shift that makes Lacework unique in the security industry. And this fundamental shift is driven and enabled by the fundamental shift of software going into the cloud. Since Lacework is built natively for the cloud and has been a cloud-first company from the start, we were able to learn what’s normal, because we were focused on achieving the benefits of true anomaly detection only for cloud workloads.
Cloud workloads are basically characterized by containers talking with each other via networking pipes in a certain graph. If you abstract it in the right way, that graph is quite stable. You can’t have a fluctuating and randomly shifting infrastructure that is serving things at scale while remaining robust and reliable. By abstracting things out the right way, Lacework is able to learn what’s normal for this new type of cloud computing.
That’s a long answer to a simple question. But it really is at the heart of the Lacework value proposition. We’re not only good at what we do, we’re also very different. The fact that we’re very different actually has direct consequences for our customers. It means that when we give alerts, they’re very different from the alerts you might get from our competitors. They’re more understandable. There are fewer of them. There are false alerts — we have false alerts like everybody else, but we have maybe one a day. They happen because you’re actually changing something in your cloud infrastructure. We’re saying, “Hey, gee whiz, there’s been a change,” and sometimes the people running the cloud operations didn’t even know that that change had happened.
How does Lacework determine what’s normal? What signal is collected and how is it analyzed?
The key technology that drives Lacework’s unique value proposition is our Polygraph. A Polygraph is basically a virtual, abstract view of what is happening in the customer’s cloud.
The concrete details of what’s happening in the cloud involve things like processes launching, people completing their work, and things migrating between one machine or another. There’s a lot of flux there. IP addresses are not static. DNS names are not static. Even the processes and jobs are not static. A lot of people have moved to microservices, which run maybe for a few seconds or minutes, whereas traditionally, in enterprises, you might have had infrastructure components like client server computing, which would run for months at a time without downtime. Now in the cloud, you have an abstraction that maybe is an ephemeral set of microservices — none of which lives more than 30 seconds.
That ephemeral set of services also has particular behaviors. Even though they live for a short time, they are getting their data from somewhere, they are answering questions that are coming from somewhere. We need to figure out that a particular ephemeral set of cloud activities is actually one component, and then do the same thing for the components that component talks to (which might also be virtual and ephemeral), and then figure out the right behavior for all those things to generate our Polygraph.
When a process all of a sudden opens a TFTP port, or starts reading data from a very different S3 bucket (things that are typical of bad behavior) — we can spot that and say, “Aha, in this whole abstract set of processes of which I’ve seen hundreds of thousands in the last few hours — none have ever done this before. This is a pretty big deviant change.”
The Polygraph builds up these abstractions, these virtual abstract entities, which represent a whole bunch of activities, a whole bunch of processes, a whole bunch of horizontally scalable machines that come and go. The Polygraph is a graph that explains how things normally communicate, depend on, and interact with each other. That’s our core technology. And that’s very hard.
The reason I joined Lacework is that I’ve tried to do similar things. It’s extremely hard. It’s an amazing accomplishment that they managed to make it succeed. I’m here because I was amazed. I wanted to be part of the fun.
In trying to automate anomaly detection, identifying abnormal behavior, there are two issues that commonly come up. One is having false positives. The other is that you’re missing some sort of malicious activity. How does the Polygraph handle both of those?
This comes back to why I joined. People have tried to do this in the past. In the enterprise data center world, the reason things never really got off the ground is that we didn’t even know if this approach could work, because each company was basically unique. A company had a whole bunch of unique software, cobbled together typically over decades. If you tried to understand what was normal for one company, you might eventually be able to do a good job, but that wouldn’t help you with the next company. At best, you would have a consulting business where you would need to do a tremendous amount of work to deliver something useful.
In the cloud, things are different. Most cloud native companies are actually quite similar. We can work with them to continually improve what we do and gain better insights into what we’re doing, and we can monitor how well we’re doing. Lacework does amazingly well on the key metrics like false positives and false negatives.
With respect to false positives, we have a tremendously limited amount of alerting to begin with. We raise a high alert on something that is normal only once a week or every few days. Even then, it’s actually something that is a real change that the customer probably wanted to know about.
The way I like to think of it is: in this writhing swarm that is cloud computing, where everything is in constant flux, we have that perfect vantage point from which everything looks ordered and stable. Everything just looks perfectly crystalline and ordered. Finding that insightful vantage point is the key behind our intellectual property and our value proposition. And that’s what’s incredibly hard.
Now, that vantage point could still allow lots of bad things to happen. One of the ways of getting a stable vantage point and zero false positives is to just ignore everything. You’re not going to have any false positives, because you just ignore everything.
But in terms of false negatives, there are two main reasons we have to be very happy. One of them is that there’s a standard set of benchmarks — indicators of compromise — that essentially outline characteristics of security problems in the cloud. And we actually catch all of those. We catch these in a way that’s different than our competitors because we don’t write rules for them. We just catch them as a side effect of learning what’s normal for a company.
The other reason we have to be happy about the security value we bring is that often when we sell into companies and we deploy our product in a competitive sales situation, customers actually hire pentesting. An external firm or an internal red team will conduct penetration testing exercises on their own infrastructure to see how well our systems respond. In those cases, we come out really well, better than our competition.
That’s not to say that we think we have ultimate security. There are occasionally things that the pentesters do that we don’t actually catch. Those are the types of things that prompt us to go examine that stable vantage point. But we fail less than the competition, and every single time we fail to catch something, it’s a learning opportunity for us to shift our vantage point slightly, and catch a whole slew of other similar types of problems in the future. For our competition, it’s instead an endless, Sisyphean task to maintain an exhaustive list of custom rules.
This process clearly requires a lot of data. What sort of backend infrastructure is involved in collecting and analyzing all of this data?
We have best-of breed scalable infrastructure that has been built up as the cloud has. Handling data ingestion at scale is something that we’re very good at, and we rely on a number of technologies there. Primarily, we rely on the ability to create S3 files at scale. The fact that S3 is an interface that exists in all of the cloud, and is very scalable allows us to generate and collect all of these files. If we ever need to go back and look at what really happened, we have the data for that sitting in an S3 bucket.
The other thing I should mention about ingestion is that we have an incredibly reliable system. We are not relying on sampling a subset of the data. We actually collect all of the information about what happens, to the best of our ability. This comes about because we are a security product first; we don’t want to miss an attacker who is there just for a few seconds and made only one particular network connection.
Once we go past ingestion, we need a place that is able to process all of that data from the S3 files at scale. We have built the company from the very beginning to use Snowflake for that functionality; we were actually one of Snowflake’s first customers. We bring the data into Snowflake for the large-scale, computationally expensive querying we need to do.
A key aspect of this is Snowflake’s support for semi-structured data. For instance, all of the data about activities in the cloud control plane is JSON. Snowflake is able to deal with JSON data and allow us to do queries and data manipulations on it.
How did you get into security in the first place?
I had no particular interest in working in the field of security. I got into it by accident, like a lot of people who work in security.
I was doing a PhD in the mid-90s, and then the Internet happened — that is everybody in the world started using the Internet through a Web browser. It was immediately clear that the Internet created a ridiculous number of security problems, to the extent that it seemed like this thing was never going to work. My PhD was going to be on some aspect of reliable distributed computing. Focusing on security was definitely a direct consequence of the Internet happening.
I founded my own startup, Green Border, 20 years ago to try to deal with the fact that every time you visited a web page, you were basically begging to be taken over by whoever ran that web server. We developed technology for Windows (since most Internet use was through Windows at the time) to virtualize and containerize the browser, very much like what Kubernetes applications or your operating-system-level virtualization does today. After Google acquired Green Border, that technology turned into the basis for how Google Chrome built a fundamentally better and more secure web browser.
More generally, I’ve never been interested in the reactive, rules-based approach to security, as you might be able to tell by the fact that my startup was doing containerization and mechanistic isolation to protect security. Most of my career I’ve been working on various systems that are often simply distributed systems that have a connection with security but, ultimately, are just systems for processing massive amounts of data reliably at scale. When I’ve worked on security, it has always involved a theme of trying to figure out: What are the things that have to happen for people to be able to do what they want to do? If we allow all the activity that needs to happen — and then define the complement of that necessary behavior as what we either don’t allow or raise an alert on — then nobody would run into security obstacles because everything that they wanted to have happen would be permitted.
Most notably, I’m known for something called Control Flow integrity, or CFI, which operates at a very low level of machine code instructions. It’s premised on the fact that, when a programmer is writing something, they have a model in their head of what’s supposed to happen on the machine at runtime. The compiler knows what that model is because the programmer writes functions, and functions have a start and an end. There’s effectively this contract between the person writing the program and the compiler stating that whenever a function starts, it should always start at the beginning. And whenever you call a function, or you make a “control flow,” you should be calling to the beginning of a function; the programmer wrote something thinking that this would be the only thing that should ever be allowed to happen. But it turns out that, at the lowest level, there are lots of other things that can happen. You can jump into the middle of a function, you can jump into the middle of a machine code instruction even… My big claim to fame is basically saying, “Hey, why don’t we just not allow any of that, and consequently a whole bunch of potential bad things will now be eliminated?” That’s actually been very influential and foundational to a lot of modern security, for instance in the instructions and hardware support in Arm and Intel processors. It doesn’t mean that security attacks are impossible, it just means that they are far harder to achieve. They’re either more costly or less likely, however you want to look at it. When they do happen, they have to happen in a way where they somehow appear as a part of the normal activity.
You can look at what Lacework does very much as a high-level version of the same thing. Essentially, we aspire to figure out what’s normal, then we make sure that only that desired behavior can happen. Yes, bad things can still happen. But they will at least have to look like they’re part of what’s normal. And as we get better and better at figuring out what’s normal and constraining the system so that only normal things can happen, we can limit the attacker’s reach and minimize the security risk.
When you’re not tackling these major security threats, what do you like to do with your time?
I like skiing, sailing, and other ways of going fast. I also like building things. I’ve built several cars. I’m building an electric car at the moment. An electric Austin Mini. I am actually able to have the engine and transmission mounted into the subframe right here in front of me, in the office, which is a great improvement over working on old-style, oily combustion engines.
You obtained a PhD minor in English literature and wrote a thesis on James Joyce’s Ulysses. What is your favorite chapter?
Probably the Cyclops chapter, with the chariot going up into the sky. It’s very similar to a set of Icelandic heroic sagas which are full of exaggerated tales. Figures like Cú Chulainn in the Irish tradition. And, of course, the Irish are full of tall tales and even taller people in those tales.
The high order bit, I think, is that storytelling is so important to everything that people do, and especially for us technologists. Having a good background in stories, how to tell them and how to convey things is the key to making an impact.
~ Interview by Lee Ellison and Palmer Rampell
This interview has been lightly edited for clarity and concision. It was first published in Sutter Hill’s invite-only publication, Field Notes.
The views expressed here are those of the individuals quoted and are not the views of SHV or its affiliates. While taken from sources believed to be reliable, SHV has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice.