Can data privacy and data intelligence coexist?

We have to change how we approach customer data. Instead of the current “more is better” approach to data acquisition, we need to focus on collecting only the minimum amount we need to remain intelligent.

In a business climate where data is considered one of the most important resources for financial success, this may sound counter-intuitive. However, it's a change businesses will need to make, and collecting less data is actually not as risky as it sounds.

The common assumption in business today is that the more data your systems have access to, the more intelligent they will be. This is not always the case, however. And even where it is, the inverse — that less data must thus equal less intelligence — is emphatically not true.

When the assumption prevails that more data is a competitive business differentiator, businesses are, in effect, incentivized to pursue new and more ways to gather data — often to disastrous effect.

Every day we see news about data breaches, leaks, and exposed vulnerabilities. We learn horror stories of identity theft and financial fraud, and we witness businesses suffering reputational damage, regulatory punishment, and consumer backlash because of their inability to protect the data they collect.

Privacy is only one of the problems associated with this overwhelming push for more data. There are also substantial costs associated with massive-scale data acquisition and management: computation costs, storage costs, operational costs, and more. We are in an era of big data, AI, and machine learning, and yet if data volume continues to be equated with system intelligence, these costs will continue to skyrocket.

Businesses today want to know absolutely everything they can about a customer. Customers, however, recoil at the idea of their every move being watched, recorded, processed, and analyzed. The more data businesses collect, the more exposed customers feel, and when customer data gets stolen, everyone loses. Everyone except the criminals.

But if we're smarter about what data we collect and how we process and analyze it, we actually don't need anywhere near the amounts of data we think we do.

The most crucial step is to move away from collecting and relying upon individual data and towards processing and analyzing aggregated data. For example, instead of analyzing data from a single IP address, we can look at IP prefix, and in doing so, we can derive all the intelligence we need. The advantage of this approach is that, the more we can process data at the group level, the less we need to know about individual users. While this may seem paradoxical, the truth is that we can derive more relevant intelligence, even as we require less data. When we engage in feature engineering — a critical part of building advanced models — we can create features based on aggregated data for a specific period of time; for example, a feature to calculate the total amount of transactions processed from a particular device where the amount of each transaction exceeds a defined threshold. With this approach, we don't need to know individual transaction amounts precisely.

Additionally, with holistic analysis conducted at the group level, we can uncover patterns, trends, and commonalities across actions and accounts that wouldn’t be discernible at the individual level. This enables us to glean a unique layer of valuable insights without having to delve further into individual accounts. The net result is less demand for individual data and greater overall intelligence. Derived data adds another layer of benefit, in that, from one single data point, we can determine multiple additional features that enable us to further refine results. For example, we can look at IP ranges to draw distinctions between normal and abnormal mobility patterns, and in so doing, we can accurately determine whether an individual user is traveling, without having to know specifics such as flight and hotel details.

Using these kinds of techniques represents a significant shift, and enables us to better align our efforts with evolving standards of big data ethics.

The more insight we can derive from the aggregated data, the less we have to ask of individual users. What makes this possible is unsupervised machine learning (UML).

Without UML, the focus is on using what we know about an individual to predict that individual’s future behavior. We have to run through this process over and over again, customer by customer. This is an extremely data-heavy approach. With UML, we can review users at the group level and derive valuable intelligence from observed correlations and patterns across accounts and actions. Ultimately, we only need a few data points about an individual in order to match them to a subgroup of users; we can then predict the future behavior of the subgroup.

In my field, which is fraud prevention, proactive detection is vital to the success of our mission. To keep our customers safe, we must be able to detect burgeoning attacks before damage can occur. To achieve this, we work with aggregated data and employ holistic analysis to surface unusual correlations and patterns that indicate fraudulent and malicious actions and accounts. We can do this using only a small handful of data points obtained from and about any given user. We find that businesses often already have the data they need—they’re just not mining it effectively.

Developments across the global regulatory landscape make clear we are moving towards more privacy and transparency, and increasing restrictions on data collection. But this doesn’t need to mean a drop in our data intelligence. Through holistic data analysis practices and advanced AI and unsupervised machine learning, we can gain a high level of intelligence while preserving user privacy.

As realities around data acquisition and management change, the value of UML is rapidly becoming more apparent, particularly in comparison to supervised machine learning. While SML makes a certain amount of sense in the context of an endlessly data-rich environment — the more we want to know, the more data we feed to our algorithms — there are significant problems associated with this unchecked pace of data acquisition. With UML, we can change the paradigm, because we can reduce the information we need to acquire from individuals. The implications for privacy are immediate. There are also clear advantages when it comes to matters such as bias (which can be introduced with labeled data, as required by SML). UML is objective; it performs its groupings based solely on the patterns it discovers in unstructured data. This enables us to identify new patterns that a traditional approach like SML would miss.

Already, we are seeing the banking and payments sector being proactive with these new capabilities. Financial services providers, for example, understand immediately the value UML can deliver. Intrusions into privacy, and heightened security and verification measures, have always been associated with increased friction for customers. With UML, these businesses are able to deliver the exemplary experiences their customers expect, without adding undue friction. Striking the right balance between risk management, customer experience, and ethical data acquisition is critical for financial institutions in today’s digital economy.

Today, we stand on the threshold of a new era, in which ethics and intelligence need not be mutually exclusive.

Yinglian Xie is CEO and Co-Founder of DataVisor

More