A call for data-first security

Over the past two decades we have seen security get more and more granular, going deeper into the stack generation after generation — from hardware, to network, server, container and now more and more to code.

It should be focused on the data. First.

The next frontier in security is data, especially sensitive data. Sensitive data is the data organizations don’t want to see leaked or breached. This includes PHI, PII, PD and financial data. A breach of sensitive data carries real penalties. Some are tangible, such as GDPR fines (€10m or 2% of annual revenue), FTC fines (e.g. $150m against Twitter) and legal fees. Then there are intangible costs, such as the loss of customer trust (e.g Chegg exposed data belonging to 40 million users), restructuring pain, and worse.

Today’s data protection technologies overly embrace bolt-on approaches. Just look at identity management. It's designed to verify who’s who. In reality, these approaches contain inevitable points of failure. Once authorized by identity management, users have carte blanche to access important data with minimal constraints.

What would happen if you made data the center of the security universe?

One of the most precious assets organizations want to protect is data, and massive data breaches and data leaks occur all too often. It's time for a new evolution of cybersecurity: data-first security.

Data is different

First, let’s acknowledge that data doesn’t exist in a vacuum. If you’ve struggled to comprehend and abide by GDPR, you know that data is tightly coupled to many systems. Data is processed, stored, copied, modified and transferred by and between systems. At every step, the vulnerability potential increases. That’s because the systems associated with these steps are vulnerable, not because the data is.

The basic concept is simple. Stop focusing on every system individually without any knowledge of the data they carry and the links between them. Instead, start with data, then pull the thread. Is sensitive data involved in chatty loggers? Is data shared with non-authorized third parties? Is data stored in S3 buckets missing security controls? Is data missing encryption? The list of potential vulnerabilities is long.

The challenge with data security is that data flows almost infinitely across systems, especially in a cloud-native infrastructure. In an ideal world, we should be able to follow the data and its associated risks and vulnerabilities across every system, at any time. In reality, we are far from this.

Data-first security should start in the code. That means with developers: Shift left. According to GitLab, 57% of security teams have shifted security left already or are planning to this year. Start at the beginning of the journey, securing data while you code.

But the dirty secret of shift-left is that too often it simply means organizations push more work onto the engineering team. For example, they might have them complete surveys and questionnaires that somehow assume they have expertise in data governance requirements across global economies, local markets and highly-regulated vertical industries. That’s not what developers do.

So a data-first security approach must include three components: 1) It can’t be another security liability; 2) It must understand ownership context; 3) It protects against errors in custom business logic (not every breach involves a bug).

Not another security liability

Security is about mitigating risk. Adding a new tool or vendor goes against this basic principle. We all have SolarWinds in mind, but others emerge daily. Having a new tool integrating with your production environment is a big ask, not only for the security team, but for the SRE/Ops team. Performing data discovery on production infrastructure means looking at actual values, potential customer data — essentially what we are trying to protect in the first place. Maybe the best way to not become yet another risk is to simply not access sensitive infrastructures and data.

Since a data-first security approach relies on sensitive data knowledge, it might be surprising to be able to perform this discovery only from the codebase — especially when we’re used to DLP and data security posture management (DSPM) solutions that perform discovery on production data. It’s true that in the codebase we don’t have access to actual data (values), only metadata. But interestingly, it’s also very accurate to discover sensitive data this way. Indeed, the lack of access to values is counterbalanced by the access to a massive amount of contexts, which is key for classification.

As valuable as traditional shift-left security is, a data-first security approach provides even more value when it comes to not being yet another risk for the organization.

Ownership context

When it comes to data security and data protection, not everything is black or white. Some risks and vulnerabilities are extremely easy to identify. Examples include a logger leaking PHI, or an SQL injection exposing PD, but others require a certain level of discussion to assess risk and ultimately decide on the best remediation. Now we are entering the borderline territory of compliance, which is never very far away when we are talking about data security.

Why are we storing this data? What’s the business reason for sharing this data with this third party? These are questions that organizations must answer at a certain point. Today these questions are increasingly handled by security teams, especially in cloud-native environments. Answering them, and identifying associated risks, is nearly impossible without unveiling the “ownership.”

By doing data-first security from the point of view of the code, we have direct access to massive contextual information — in particular, when something has been introduced and by whom. DSPM solutions simply can’t provide this context by looking exclusively at production data stores.

Too often organizations rely on “manual assessment.” They send questionnaires to the entire engineering team to understand which sensitive data is processed, why and how. Developers loathe these questionnaires and often don’t understand many of the questions. The poor data security results are predictable.

As with most “technical” things, the most effective approach is to automate tedious tasks with a process that drops into existing workflows with minimal or no friction if you are serious about data security, especially at scale.

Custom business logic

As every organization is different, coding practices and associated policies differ, especially for larger engineering teams. We’ve seen many companies doing application-level encryption, end-to-end encryption or connecting to their data warehouse in very specific ways. Most of these logic flows are extremely difficult to detect outside the code, resulting in a lack of monitoring, and introducing security gaps.

Let’s take Airbnb as an example. It notoriously built its own data protection platform. What’s interesting to look at here is the custom logic the company implemented to encrypt its sensitive data. Instead of relying on a third-party encryption service or library (there are dozens), Airbnb built its own, Cypher. This provides libraries in different languages that allow developers to encrypt and decrypt sensitive data on the fly. Detecting this encryption logic, or more importantly lack of it, on certain sensitive data outside of the codebase would prove very difficult.

But is code enough?

Starting a data-first security journey from code makes a lot of sense, especially since many insights found there are not accessible anywhere else (although it’s true that some information might be missing and only found at the infrastructure or production level.)

Reconciling information between code and production is extremely difficult, especially with data assets flowing everywhere. Airbnb shows how complex it can be. The good news is that with the shift to infrastructure as code (IaC), we can make the connections at the code level and avoid dealing with painful reconciliation.

Considering the challenges associated with security and data, every security solution will have to become at least “data-aware” and possibly “data-first” at whatever layer of the stack they exist in. We can already see cloud security posture management (CSPM) solutions blending with DSPM, but will it be enough?

Guillaume Montard is cofounder and CEO of Bearer.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!