Efficient data governance with AI segmentation

Digital transformation has fundamentally changed how businesses interact with their partners, supply chains, and customers. It has also exponentially increased the amount of data generated and stored by organizations.

Our data conundrum

Modern enterprises generally have hundreds of terabytes, if not petabytes, of data, much of which is unstructured. This type of data can make up 80 to 90% of an enterprise’s entire data footprint, and because it is unstructured, it is largely ignored. However, certain elements of unstructured data contain sensitive information that may fall prey to breaches.

The conundrum: We don’t know which data is sensitive; it’s like trying to find a needle in a haystack.

New tools may replace cumbersome data governance methods

With an abundance of data accumulated over many years, queries from regulators and discovery orders from legal authorities sprout up frequently.

A typical reaction by data managers may be to put an immediate process in place — perhaps having employees sign a statement vowing not to store sensitive data and then conducting training about personally identifiable information (PII). But this is a mere “Band-Aid” solution placed on the process as they hope for the best.

Alternatively, data managers can sift through mounds of data. They scan each and every document, trying to unveil sensitive data. But scanning the petabytes of unstructured data would take years. It is also quite costly and too time-consuming to achieve desired results, which causes many data managers to eschew this approach.

Sensitive data and the rise of AI-based data segmentation

An effective and efficient technology is available to replace such archaic methods and reduce risk fast, at a fraction of the cost: artificial intelligence (AI) segmentation.

With AI-based segmentation, we ascertain what attributes of a file point to it being more likely to contain sensitive data after scanning just a small statistical sample of files. This provides us with important information to prioritize our search for high-risk data. For example, are Word documents at a higher risk than PowerPoint presentations? Is there a particular folder that is more likely to contain sensitive data?

Once we have our riskiest data highlighted, we can immediately start a full scan and remediation process, eliminating the highest risk as early in the process as possible. Thus, we have prioritized the remediation process to achieve the greatest risk reduction in the least amount of time.

For example, suppose we have many terabytes of data broken up into chunks of 100 terabytes. To index or scan 100 terabytes at a time could require several months of work, and it takes even longer to go through all of it.

However, if instead, I take a statistical sample (that is, looking at around 9,500 out of a total of 1 million files), I can be 95% confident in my results.

If in the first 100 terabytes, my results say that 5% of the data contains personal information, I’d know that if I ran the same test another 100 times, 95 times out of the hundred, I’d be within 1% of that 5% level (that is, 4–6% is PII or information or files that contain PII). I can perform this iteration in a fraction of the time — hours instead of months — and have a good idea of how large the issue is.

Then, if I look at a second 100 terabyte chunk, and 20% contains PII, I now have a prioritization. I know that my time is best served by looking at that second chunk of data first.

But we can do even better. For that second chunk of data, we can apply AI models to further segment the 100-terabyte chunk into buckets based on the expected probability of a file having PII. We may find that only one terabyte out of the total 100 terabytes has a probability of more than 50% containing PII.

I will then scan all terabytes and remediate the issues. I can then move on to the next riskiest area and then the next riskiest area. Progress has improved by leaps and bounds compared to sifting through all 200 terabytes from beginning to end. This approach is an effective, efficient, reliable and accepted means of validating data.

Regulators and legal authorities are always looking for companies to take reasonable steps toward compliance. This approach is pragmatic and results in the fastest possible reduction in files containing sensitive data.

Save time and reduce cost as you work toward compliance

Using a prioritized approach to data governance makes good sense. AI segmentation and scanning, based on statistical sampling with a reasonable confidence interval, helps identify sensitive data efficiently and effectively. While I’ve focused mostly on privacy use cases, this same process for identifying data can be applied to many other use cases, including highlighting corporate IP, data relevant to divestiture, and regulated data. We can help find the needles in your haystack a lot faster through the use of sampling and segmentation.

Will Jaibaji is a cofounder and SVP of product strategy at Breakwater Solutions in Austin, Texas.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

Our data conundrum

New tools may replace cumbersome data governance methods

Sensitive data and the rise of AI-based data segmentation

Save time and reduce cost as you work toward compliance

More