Breaking 'bad data' with machine learning

"An underlying issue that most enterprise organizations struggle with is that their data is a disaster," noted Anthony Deighton, chief product officer at AI-powered data unification company Tamr. Deighton was moderating a panel at VentureBeat's Transform 2021 event today, which delved into practical and academic perspectives on how companies -- particularly financial institutions -- can use machine learning (ML) to improve the quality and reliability of their data.

Deighton was joined by Tamr cofounder Michael Stonebraker, winner of the 2015 Turing award and a renowned computer scientist who specializes in database research; and Jonathan Holman, head of digital transformation at financial services company Santander U.K., a Tamr customer.

So what is the problem that Tamr, ultimately, is setting out to solve?

"Every single large organization -- including Santander -- is heavily siloed," Stonebraker said. "You buy and sell organizations, and they come with their own data silos. There is huge business value to integrating the silos."

As Santander's U.K. chief for digital transformation, Holman is also well aware of the value that data holds.

"It's [data] the key enabler -- if you don't understand your business, and if you don't do that through data, you can't manage it, you can't achieve your strategy, you can't even really set your strategy," Holman said. "How can you know where you're going if you don't know where you're at?"

The data problem

Santander uses Tamr 360 to "cleanse and master customer data," joining the dots between internal and external data sources (structured and unstructured) to build a holistic view of its customers.

Given that data is often stored in siloed systems within different businesses units (e.g. retail and corporate) and SaaS applications that aren't interoperable, there may be hundreds of duplicate customer records. To garner a unified view of each customer, Santander reconciled records from multiple databases, product systems, and other data sources into a data lake. It used Tamr's data mastering smarts to identify potential duplicate records, and it set "confidence thresholds" for the machine learning models -- if a model exceeded the threshold, a feedback workflow then engaged experts in the bank to correct the data and improve the model.

The cleansed data is then thrown back into the data lake, where it can be used to power all manner of decision-making processes. This includes its credit lending system, which Santander said enabled it to cut its corporate credit decision times in half. However, beyond keeping lucrative customers happy through expediency, getting the data management side of things right is particularly important for highly regulated industries such as financial services.

"How can you really be sure of your capital position as a financial services institution if you're not certain of your single customer record, with all the products, with all the transactions, with all the people that are associated with that business in one place?" Holman asked. "It means you can monitor them, you can assess them against ... whether or not they might be [for example] politically exposed or a sanctioned terrorist. Or just to understand how much money you might have out to them any one time, so if things went wrong, how much security or collateral you might have to reclaim that money."

Machine learning

Getting the data right, in which centralization and consolidation play a major part, is vital. But "getting data right" at scale -- in a bank with large volumes of data and many disparate sources -- benefits from an automated machine learning approach versus a manual rules-based approach, according to Stonebraker.

"If you want to do this for 5,000 records, use whatever you want, but at scale -- 14 million records -- rules-based systems do not work," Stonebraker said. "There is ample evidence for that. It just takes too many rules and no one can understand thousands of rules, so the technology just plain fails. At scale, if you want to do data integration of customers, suppliers, products, whatever it is that you're trying to master, you've got to do it using machine learning."

For big companies -- the sort of companies that will likely need to manage and consolidate data at scale -- Stonebraker advised bypassing manual rules-based systems from the get-go when consolidating and unifying data, even if they are starting off in a small way. "The way you make mistakes is by first of all trying to apply rules-based systems. If you think you are going to have to do this at scale at some point in the future, then go directly to machine learning," he said. "Machine learning is in your future, so if you can easily do so, hire a machine learning wizard that will make your journey go much smoother."

But before a business can enjoy the fruits of machine learning, it still has to get the data right.

"Every enterprise is interested in data science, and often that is machine learning models to understand your business," Stonebraker said. "However, if you have dirty data, your machine learning models are going to be crap."

The data problem

Machine learning

More