Exploring Amazon SageMaker's new features -- Clarify, Pipelines, Feature Store

Welcome to part 2 of our two-part series on AWS SageMaker. If you haven't read part 1, hop over and do that first. Otherwise, let's dive in and look at some important new SageMaker features:

Clarify, which claims to "detect bias in ML models" and to aid in model interpretability
SageMaker Pipelines, which help automate and organize the flow of ML pipelines
Feature Store, a tool for storing, retrieving, editing, and sharing purpose-built features for ML workflows.

Clarify: debiasing AI needs a human element

At the AWS re:Invent event in December, Swami Sivasubramanian introduced Clarify as the tool for "bias detection across the end-to-end machine learning workflow" to rapturous applause and whistles. He introduced Nashlie Sephus, Applied Science Manager at AWS ML, who works in bias and fairness. As Sephus makes clear, bias can show up at any stage in the ML workflow: in data collection, data labeling and selection, and when deployed (model drift, for example).

The scope for Clarify is vast; it claims to be able to:

perform bias analysis during exploratory data analysis
conduct bias and explainability analysis after training
explain individual inferences for models in production (once the model is deployed)
integrate with Model Monitor to provide real-time alerts with respect to bias creeping into your model(s).

Clarify does provide a set of useful diagnostics for each of the above in a relatively user-friendly interface and with a convenient API, but the claims above are entirely overblown. The challenge is that algorithmic bias is rarely, if ever, reducible to metrics such as class imbalance and positive predictive value. It is valuable to have a product that provides insights into such metrics, but the truth is that they're below table stakes. At best, SageMaker claiming that Clarify detects bias across the entire ML workflow is a reflection of the gap between marketing and actual value creation.

To be clear, algorithmic bias is one of the great challenges of our age: Stories of at-scale computational bias are so commonplace now that it's not surprising when Amazon itself "scraps a secret recruiting tool that showed bias against women." To experience first-hand ways in which algorithmic bias can enter ML pipelines, check out the instructional game Survival of the Best Fit.

Reducing algorithmic bias and fairness to a set of metrics is not only reductive but dangerous. It doesn't incorporate the required domain expertise and inclusion of key stakeholders (whether domain experts or members of traditionally marginalized communities) in the deployment of models. It also doesn't engage in key conversations around what bias and fairness actually are; and, for the most part, they're not easily reducible to summary statistics.

There is a vast and growing body of literature around these issues, including 21 fairness definitions and their politics (Narayanan), Algorithmic Fairness: Choices, Assumptions, and Definitions (Mitchell et al.), and Inherent Trade-Offs in the Fair Determination of Risk Scores (Kleingberg et al.), the last of which shows that there are three different definitions of algorithmic fairness that basically can never be simultaneously satisfied.

There is also the seminal work of Timnit Gebru, Joy Buolamwini, and many others (such as Gender Shades), which gives voice to the fact that algorithmic bias is not merely a question of training data and metrics. In Dr. Gebru's words: “Fairness is not just about data sets, and it's not just about math. Fairness is about society as well, and as engineers, as scientists, we can't really shy away from that fact.”

To be fair, Clarify's documentation makes clear that consensus building and collaboration across stakeholders—including end users and communities—is part of building fair models. It also states that customers “should consider fairness and explainability during each stage of the ML lifecycle: problem formation, dataset construction, algorithm selection, model training process, testing process, deployment, and monitoring/feedback. It is important to have the right tools to do this analysis."

Unfortunately, statements like "Clarify provides bias detection across the machine learning workflow" make the solution sound push-button: as if you just pay AWS for Clarify and your models will be unbiased. While Amazon's Sephus clearly understands and articulates that debiasing will require much more in her presentation, such nuance will be lost on most business executives.

The key takeaway is that Clarify provides some useful diagnostics in a convenient interface, but buyer beware! This is by no means a solution to algorithmic bias.

Pipelines: right problem but a complex approach

SageMaker Pipelines (video tutorial, press release). This tool claims to be the "first CI/CD service for machine learning." It promises to automatically run ML workflows and helps organize training. Machine learning pipelines often require multiple steps (e.g. data extraction, transform, load, cleaning, deduping, training, validation, model upload, etc.), and Pipelines is an attempt to glue these together and help data scientists run these workloads on AWS.

So how well does it do? First, it is code-based and greatly improves on AWS CodePipelines, which were point-and-click based. This is clearly a move in the right direction. Configuration was traditionally a matter of toggling dozens of console configurations on an ever-changing web console, which was slow, frustrating, and highly non-reproducible. Point-and-click is the antithesis of reproducibility. Having your pipelines in code makes it easier to share and edit your pipelines. SageMaker Pipelines is following in a strong tradition of configuring computational resources as code (the best-known examples being Kubernetes or Chef).

Specifying configurations in source-controlled code via a stable API has been where the industry is moving.

Second, SageMaker Pipelines are written in Python and have the full power of a dynamic programming language. Most existing general-purpose CI/CD solutions like Github Actions, Circle CI, or Azure Pipelines use static YAML files. This means Pipelines is more powerful. And the choice of Python (instead of another programming language) was smart. It’s the predominant programming language for data science and probably has the most traction (R, the second most popular language, is probably not well suited for systems work and is unfamiliar to most non-data developers).

However, the tool’s adoption will not be smooth. The official tutorial requires correctly setting IAM permissions by toggling console configurations and requires users to read two other tutorials on IAM permissions to accomplish this. The terminology appears inconsistent with the actual console (“add inline policy” vs. “attach policy” or “trust policy” vs. “trust relationship”). Such small variations can be very off-putting for those who are not experts in cloud server administration -- for example, the target audience for SageMaker Pipelines. Outdated and inconsistent documentation is a tough problem for AWS, given the large number of services AWS offers. It is perhaps a victim of the Walt Whitman’s quote: “Do I contradict myself? Very well, then I contradict myself, I am large, I contain multitudes.”

The tool also has a pretty steep learning curve. The official tutorial has users download a dataset, split it into training and validation sets, and upload the results to the AWS model registry. Unfortunately, it takes 10 steps and 300 lines of dev-ops code (yes, we counted). That’s not including the actual code for ML training and data prep. The steep learning curve may be a challenge to adoption, especially compared to radically simpler (general purpose) CI/CD solutions like Github Actions.

This is not a strictly fair comparison and (as mentioned previously) SageMaker Pipelines is more powerful: It uses a full programming language and can do much more. However, in practice, CI/CD is often used solely to define when a pipeline is run (e.g., on code push or at a regular interval). It then calls a task runner (e.g., gulp or pyinvoke are both much easier to learn; pyinvoke's tutorial is 19 lines), which brings the full power of a programming language. We could connect to the AWS service through their respective language SDKs, like the widely used boto3. Indeed, one of us used (abused?) Github Actions CI/CD to collect weekly vote-by-mail signup data across dozens of states in the run-up to the 2020 election and build monthly simple language models from the latest Wikipedia dumps. So the question is whether an all-in-one tool like SageMaker Pipelines is worth learning if it can be replicated by stitching together commonly used tools. This is compounded by SageMaker Pipelines being weak on the natural strength of an integrated solution (not having to fight with security permissions amongst different tools).

AWS is working on the right problem. But given the steep learning curve, it’s unclear whether SageMaker Pipelines will be enough to convince folks to switch from the simpler existing tools they're used to using. This tradeoff points to a broader debate: Should companies embrace an all-in-one stack or use best-of-breed products? More on that question shortly.

Feature Store: a much-needed feature for the enterprise

As Sivasubramanian mentioned in his re:Invent keynote, "features are the foundation of high-quality models." SageMaker Feature Store provides a repository for creating, sharing, and retrieving machine learning features for training and inference with low latency.

This is exciting as it's one of many key aspects of the ML workflow that has been siloed across a variety of enterprises and verticals for too long, such as in Uber's ML platform Michelangelo (its feature store is called Michelangelo Palette). A huge part of the democratization of data science and data tooling will require that such tools be standardized and made more accessible to data professionals. This movement is ongoing: For some compelling examples, see Airbnb's open-sourcing of Airflow, the data workflow management tool, along with the emergence of ML tracking platforms, such as Weights and Biases, Neptune AI, and Comet ML. Bigger platforms, such as Databricks' MLFlow, are attempting to capture all aspects of the ML lifecycle.

Most large tech companies have their internal feature stores; and organizations that don't keep feature stores end up with a lot of duplicated work. As Harish Doddi, co-founder and CEO of Datatron said several years ago now on the O'Reilly Data Show Podcast: “When I talk to companies these days, everybody knows that their data scientists are duplicating work because they don’t have a centralized feature store. Everybody I talk to really wants to build or even buy a feature store, depending on what is easiest for them.”

To get a sense of the problem space, look no further than the growing set of solutions, several of which are encapsulated in a competitive landscape table on FeatureStore.org:

The SageMaker Feature Store is promising. You have the ability to create feature groups using a relatively Pythonic API and access to your favorite PyData packages (such as Pandas and NumPy), all from the comfort of a Jupyter notebook. After feature creation, it is straightforward to store results in the feature group, and there's even a max_workers keyword argument that allows you to parallelize the ingestion process easily. You can store your features both offline and in an online store. The latter enables low-latency access to the latest values for a feature.

The Feature Store looks good for basic use cases. We could not determine whether it is ready for production use with industrial applications, but anyone in need of these capabilities should check it out if you already use SageMaker or are considering incorporating it into your workflow.

Final thoughts

Finally, we come to the question of whether or not all-in-one platforms, such as SageMaker, can fulfill all the needs of modern data scientists, who need access to the latest, cutting edge tools.

There's a trade-off between all-in-one platforms and best-of-breed tooling. All-in-one platforms are attractive as they can co-locate solutions to speed up performance. They can also seamlessly integrate otherwise disparate tools (although, as we’ve seen above, they do not always deliver on that promise). Imagine a world where permissions, security, and compatibility are all handled seamlessly by the system without user intervention. Best-of-breed tooling can better solve individual steps of the workflow but will require some work to stitch together. One of us has previously argued that best-of-breed tools are better for data scientists. The jury is still out. The data science arena is exploding with support tools, and figuring out which service (or combination thereof) makes for the most effective data environment will keep the technical community occupied for a long time.

Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator, a data science training and placement firm. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw.

Hugo Bowne-Anderson is Head of Data Science Evangelism and VP of Marketing at Coiled. Previously, he was a data scientist at DataCamp, and has taught data science topics at Yale University and Cold Spring Harbor Laboratory, conferences such as SciPy, PyCon, and ODSC, and with organizations such as Data Carpentry. [Full Disclosure: As part of its services, Coiled provisions and manages cloud resources to scale Python code for data scientists, and so does offer something that SageMaker also does as part of its services. But it's also true that all-one-platforms such as SageMaker and products such as Coiled can be seen as complementary: Coiled has several customers who use SageMaker Studio alongside Coiled.]

If you're an experienced data or AI practitioner, consider sharing your expertise with the community via a guest post for VentureBeat.