AI holds enormous promise for the public sector — at least theoretically. But in practice, integrating AI components into public sector apps is limited by the fragility of those components and by mismatches among them. For example, if a machine learning model is trained on data that’s different from data in an operational environment, the component’s performance will be dramatically reduced.
This conundrum motivated researchers at Carnegie Mellon to investigate classes of configurations in AI systems integration and to identify assumptions made by practitioners in different fields, including data scientists, software engineers, and operations staff. Their intention was to find new ways to communicate appropriate information explicitly, while at the same time developing practices to mitigate the impacts of mismatches.
The coauthors — whose paper detailing the work was accepted to this year’s Artificial Intelligence in Government and Public Sector Proceedings conference — point out that deploying AI models in production remains a formidable challenge. That’s because the models’ development and operation often involve three different perspectives: that of a data scientist, a software engineer, and an operations staffer. The first builds the model and trains it before testing it against a set of common metrics, while the second integrates the trained model into a larger system and the third deploys, operates, and monitors the entire system.
These three perspectives operate separately and use different jargon, the researchers say, leading to mismatch between assumptions. As a result, computing resources used during model testing aren’t uncommonly different from those used during operations, causing poor performance. What’s worse, monitoring tools often aren’t set up to detect diminishing model accuracy or system failure.
The team’s solution is what they call machine-readable ML-Enabled System Element Descriptors, a mechanism to enable mismatch detection and prevention in AI systems. The descriptors codify attributes to make explicit assumptions from all of the aforementioned perspectives. That is, they can be used for information and evaluation purposes in a manual, supervised way, or they can inform the development of automated mismatch detectors that run at design time and runtime.
The researchers propose a descriptor-constructing approach consisting of three phases. They’d first elicit examples of mismatches and their consequences via interviews with machine learning practitioners, after which they would identify attributes used to describe system elements from GitHub descriptions and literature. In the subsequent mapping stage, they’d decide upon a set of attributes for each mismatch that could be used to detect them, conducting gap and feasibility analyses to surface mismatches that don’t map to any attributes.
The team would then re-engage with the practitioners from the first stage to validate their work, with the goal of reaching 90% agreement. They’d also develop a demonstration of automated mismatch detection and create scripts that could detect the mismatch — ideally examples that would validate the mapping between mismatches and attributes and the set of descriptors created from that mapping.
“Our vision for this work is that the community starts developing tools for automatically detecting mismatch, and organizations start including mismatch detection in their toolchains for development of ML-enabled systems,” wrote the coauthors. “To enable [AI] components to be fielded in a meaningful way, we will need to understand the mismatches that exist and develop practices to mitigate the impacts of these mismatches.”