Bounding your ML models: Don’t let your algorithms run wild

The purpose of designing and training algorithms is to set them loose in the real world, where we expect performance to mimic that of our carefully curated training data set. But as Mike Tyson put it, “everyone has a plan, until they get punched in the face.” And in this case, your algorithm’s meticulously optimized performance may get punched in the face by a piece of data completely outside the scope of anything it encountered previously.

When does this become a problem? To understand, we need to return to the basic concepts of interpolation vs. extrapolation. Interpolation is an estimation of a value within a sequence of values. Extrapolation estimates a value beyond a known range. If you're a parent, you can probably recall your young child calling any small four-legged animal a cat, as their first classifier only used minimal features. Once they were taught to extrapolate and factor in additional features, they were able to correctly identify dogs too. Extrapolation is difficult, even for humans. Our models, smart as they might be, are interpolation machines. When you set them to an extrapolation task beyond the boundaries of their training data, even the most complex neural nets may fail.

What are the consequences of this failure? Well, garbage in, garbage out. Beyond the deterioration of model results in the real world, the error can propagate back to training data in production models, reinforcing erroneous results and degrading model performance over time. In the case of mission critical algorithms, as in healthcare, even a single erroneous result should not be tolerated.

What we need to adopt, and this is not a unique problem in the domain of machine learning, is data validation. Google engineers published their method of data validation in 2019 after running into a production bug. In a nutshell, every batch of incoming data is examined for anomalies, some of which can only be detected by comparing training and production data. Implementing a data validation pipeline had several positive outcomes. One example the authors present in the paper is the discovery of missing features within the Google Play store recommendation algorithm — when the bug was fixed, app install rates increased by 2 percent.

Researchers from UC Berkeley evaluated the robustness of 204 image classification models in adapting to distribution shifts arising from natural variation in data. Despite the models being able to adapt to synthetic changes in data, the team found little to no adaptation in response to natural distribution shifts, and they consider this an open research problem.

Clearly this is a problem for mission critical algorithms. Machine learning models in healthcare bear a responsibility to return the best possible results to patients, as do the clinicians evaluating their output. In such scenarios, a zero-tolerance approach to out-of-bounds data may be more appropriate. In essence, the algorithm should recognize an anomaly in the input data and return a null result. Given the tremendous variation in human health, along with possible coding and pipeline errors, we shouldn’t allow our models to extrapolate just yet.

I'm the CTO at a health tech company, and we combine these approaches: We conduct a number of robustness tests on every model to determine whether model output has changed due to variation in the features of our training sets. This training step allows us to learn the model limitations, across multiple dimensions, and also uses explainable AI models for scientific validation. But we also set out of bound limitations on our models to ensure patients are protected.

If there’s one takeaway here, it’s that you need to implement feature validation for your deployed algorithms. Every feature is ultimately a number, and the range of numbers encountered during training is known. At minimum, adding a validation step that ascertains whether a score in any given run is within the training range will increase model quality.

Bounding models should be fundamental to trustworthy AI. There is much discussion on design robustness and testing with adversarial attacks (which are designed specifically to fool models). These tests can help harden models but only in response to known or foreseen examples. However, real world data can be unexpected, beyond the ranges of adversarial testing, making feature and data validation vital. Let’s design models smart enough to say “I know that I know nothing" rather than running wild.

Niv Mizrahi is Co-founder and CTO of Emedgene and an expert in machine learning, big data, and large-scale distributed systems. He was previously Director of Engineering at Taykey, where he built an R&D organization from the ground up and managed the research, big data, automation, and operations teams.

More