How AI 'data drift' may have caused the Equifax credit score glitch

Earlier this year, from March 17 to April 6, 2022, credit reporting agency Equifax had an issue with its systems that led to incorrect credit scores for consumers being reported.

The issue was described by Equifax as a 'coding issue' and has led to legal claims and a class action lawsuit against the company. There has been speculation that the issue was somehow related to the company's AI systems that help to calculate credit scores. Equifax did not respond to a request for comment on the issue from VentureBeat.

"When it comes to Equifax, there is no shortage of finger-pointing," Thomas Robinson, vice president of strategic partnerships and corporate development at Domino Data Lab, told VentureBeat. "But from an artificial intelligence perspective, what went wrong appears to be a classic issue, errors were made in the data feeding the machine learning model."

Robinson added that the errors could have come from any number of different situations, including labels that were updated incorrectly, data that was manually ingested incorrectly from the source or an inaccurate data source.

The risks of data drift on AI models

Another possibility that Krishna Gade, cofounder and CEO of Fiddler AI speculated was possible, was a phenomenon known as data drift. Gade noted that according to reports, the credit scores were sometimes off by 20 points or more in either direction, enough to alter the interest rates consumers were offered or to result in their applications being rejected altogether.

Gade explained that data drift can be defined as the unexpected and undocumented changes to the data structure, semantics and distribution in a model.

He noted that drift can be caused by changes in the world, changes in the usage of a product, or data integrity issues, such as bugs and degraded application performance. Data integrity issues can occur at any stage of a product’s pipeline. Gade commented that, for example, a bug in the front-end might permit a user to input data in an incorrect format and skew the results. Alternatively, a bug in the backend might affect how that data gets transformed or loaded into the model.

Data drift is not an entirely uncommon phenomenon, either.

"We believe this happened in the case of the Zillow incident, where they failed to forecast house prices accurately and ended up investing hundreds of millions of dollars," Gade told VentureBeat.

Gade explained that from his perspective, data drift incidents happen because implicit in the machine learning process of dataset construction, model training and model evaluation is the assumption that the future will be the same as the past.

"In effect, ML algorithms search through the past for patterns that might generalize to the future," Gade said. "But the future is subject to constant change, and production models can deteriorate in accuracy over time due to data drift."

Gade suggests that if an organization notices data drift, a good place to start remediation is to check for data integrity issues. The next step is to dive deeper into model performance logs to pinpoint when the change happened and what type of drift is occurring.

"Model explainability measures can be very useful at this stage for generating hypotheses," Gade said. "Depending on the root cause, resolving a feature drift or label drift issue might involve fixing a bug, updating a pipeline, or simply refreshing your data."

Playtime is over for data science

There is also a need for the management and monitoring of AI models. Gade said that robust model performance management techniques and tools are important for every company operationalizing AI in their critical business workflows.

The need for companies to be able to keep track of their ML models and ensure they are working as intended was also emphasized by Robinson.

"Playtime is over for data science," Robinson said. "More specifically, for organizations that create products with models that are making decisions impacting people’s financial lives, health outcomes and privacy, it is now irresponsible for those models not to be paired with appropriate monitoring and controls."