Exploring Amazon SageMaker's new features -- CloudFormation, Data Wrangler

Updated 2/14/21 at 9:14pm PST

The data tooling and infrastructure space is growing rapidly, and this trend is showing no signs of slowing down. Behemoth data storage firm Snowflake IPOed late last year and became more valuable than IBM, and Databricks recently raised a $1 billion Series G with a $28 billion post-money valuation, to name two examples. The long tail of the data tools space is becoming increasingly crowded, as evidenced by Matt Turck's 2020 Data & AI Landscape (just look at the image below).

AWS is one of the most prominent players in the space, and SageMaker is its flagship solution for the machine learning development workflow. When AWS announces new SageMaker features, the industry pays attention. Having written two reviews since Sagemaker Studio’s inception, we were interested to see a swathe of new features come across the wire last December and at Swami Sivasubramanian's Machine Learning Keynote at re:Invent. After spending some time with the new features, we’ve put together a two-part piece on our impressions. This first part covers:

Better integration with AWS CloudFormation, which allows for easier provisioning of resources
General ability to use Sagemaker and the platform's usability
Data Wrangler, a GUI-based tool for data preparation and feature engineering

The second part covers

Feature Store, a tool for storing, retrieving, editing, and sharing purpose-built features for ML workflows
Clarify, which claims to "detect bias in ML models" and to aid in model interpretability
Sagemaker Pipelines, which help automate and organize the flow of ML pipelines

Let’s get started!

One-click provisioning makes it easier to get started

Overall, we found the experience with SageMaker much smoother than last time. The SageMaker Studio environment would actually start and provision (it embarrassingly refused to launch last time during re:Invent). There overall experience felt much improved, and the tutorials and documentation are better integrated with the platform.

One of the environment's best features is AWS CloudFormation, which have been around since 2011 but seem to have been better integrated into SageMaker. It's a significant pain point in computing to get hardware and infrastructure provisioned safely -- getting S3 buckets, databases, EC2 instances all up and talking to each other securely. This often meant hours of tinkering with IAM permissions just to get a “Hello World” server going. CloudFormation simplifies that by pre-defining infrastructure configuration “stacks” into YAML files (think Kubernetes Object YAML but for AWS infrastructure), which can be fired up with one click. An AWS spokesperson told us the integration was part of a move “to make SageMaker widely accessible for the most sophisticated ML engineers and data scientists as well as those who are just getting started.” Even better, many of the AWS tutorials now feature buttons to launch stacks with just one click:

(The buttons are reminiscent of a late '90s Amazon.com One-Click Shopping button and that resemblance may be subliminal marketing. Both distill immensely complex infrastructure, whether e-commerce or cloud, into a single consumer-friendly button that drives sales.)

Sagemaker has improved but usability is still lacking, hindering adoption

Given the interest in deep learning, we wanted to try out deep learning on AWS. These models are on the leading edge of machine learning but are notoriously computationally expensive to train, requiring GPUs, which can be quite spendy. We decided to test out these newfound capabilities by running examples from FastAI’s popular deep learning book to see how easy it is to get started. Fortunately, the Deep Learning models come with convenient launch buttons, so you can get up and running pretty smoothly. The AWS instances were very powerful (for a fairly computationally intensive NLP example their ml.p3.2xlarge ran about 20X faster than the free tier Quadro P5000 available on Gradient), and for only $3.825 an hour.

Nonetheless, the tools were not without their hiccups. On AWS, most of the GPU instances are not automatically available; instead, users must request a quota limit increase. Requesting a limit increase appears to require human approval and usually takes a day, killing momentum. Also, the launch stacks sometimes don’t line up with the tutorial types: e.g., the entity resolution tutorial launches with a CPU instance type, which required 24 hours to approve. When the notebook ran, it required a GPU instance. Users are not given any resource quotas for this by default and must request an increase manually, adding another 24-hour delay. This assumes they are eligible for such increases at all (one of us was not, and only found a workaround after contacting an AWS representative). Some of this may have been due to the fact that we were using a relatively new AWS account. But great software has to work for new users as well as veterans if it hopes to grow and this is what we set out to test. Great software should also work for users who do not have the luxury of a contact at AWS.

Our experience is well-summarized by Jesse Anderson, author of Data Teams. He told us that “AWS’s intent is to offload data engineer tasks to make them more doable for the data scientists. It lowers the bar somewhat but it isn't a huge shift. There's still a sizable amount of data engineering needed just to get something ready for SageMaker.”

To be fair to AWS, service quotas are useful in helping control cloud costs, particularly in a large enterprise setting where a CIO might want to enable the rank-and-file to request the services they need without incurring an enormous bill. Yet, one could easily imagine a better world. At a minimum, AWS-related error messages (e.g. resource limit constraints) should come with links to details on how to fix them rather than making users spend time hunting through console pages. For example, GCloud Firebase, which has a similar service quotas, does this well.

Even better, it would be nice if there were single-click buttons that immediately granted account owners a single instance for 24 hours so users don't have to wait for human approval. In the end, we expected a more straightforward interface. We’ve seen some tremendous improvements over last year, but AWS is still leaving a lot on the table.

Data Wrangler: right problem, wrong approach

There's a now-old trope (immortalized by Big Data Borat) that data scientists spend 80% of their time cleaning and preparing data:

Industry leaders recognize the importance of tackling this problem well. As Ozan Unlu, Founder and CEO of automated observability startup Edge Delta explained to us, “allowing data scientists to more efficiently surpass the early stages of the project allows them to spend a much larger proportion of their time on significantly higher value additive tasks.” Indeed, one of us previously wrote an article called The Unreasonable Importance of Data Preparation, clarifying the need to automate parts of the data preparation process. SageMaker Studio's Data Wrangler claims to "provide the fastest and easiest way for developers to prepare data for machine learning" and comes packed with exciting features, including: 300+ data transformation features (including one-hot encoders, which are table stakes for machine learning), the ability to hand-code your own transformations, and upcoming integrations with Snowflake, MongoDB, and Databricks. Users are also able to output their results and workflows to a variety of formats like SageMaker pipelines (more on this in Part 2), Jupyter notebooks, or a Feature Store (we'll get to this in Part 2 as well).

However, we're not convinced that most developers or data scientists would find it very useful yet. First off, it's GUI-based, and the vast majority of data scientists will avoid GUIs like the plague. There are several reasons for this, perhaps the most important being that GUIs are antithetical to reproducible data work. Hadley Wickham, Chief Scientist at RStudio and author of the principles of tidy data, has even given a talk entitled “You can't do data science in a GUI.”

To be fair to SageMaker, you can export your workflow as Python code, which will help alleviate reproducibility to a certain extent. This approach follows in the footsteps of products such as Looker (acquired last year by Google for $2.6 billion!), which generates SQL code based upon user interactions with a drag and drop interface. But it will probably not appeal to developers or data scientists (if you can already express your ideas in code, why learn to use someone else’s GUI?).

There may be some value in enabling non-technical domain experts (who are presumably less expensive talent resources) to transform data and export the process to code. However, the code generated from recording an iterative exploratory GUI session may not be very clean and could require significant engineering or data scientist intervention. Much of the future of data work will occur in GUIs and drag-and-drop interfaces, but this will be the long tail of data work and not that of developers and data scientists.

Data Wrangler's abstraction away from code and the abstraction over many other parts of the data preparation workflow are also concerning. Take the "quick model" feature which, according to AWS Evangelist Julien Simon, "immediately trains a model on the pre-processed data," shows "the impact of your data preparation steps," and provides insight into feature importance. When building this quick model, it isn't clear in the product what kind of model is actually trained, so it's not obvious how any insight could be developed here or whether the "important features" are important at all.

Most troubling is Data Wrangler’s claim to be providing insight into your data and your model, when you can use it without any form of domain expertise at all. This is in stark contrast to tools such as Snorkel, a project that aims to “inject domain information [or heuristics] into machine learning models in higher-level, higher-bandwidth ways." This lack of input is particularly worrisome in an era rife with AI bias issues. One key aspect of the future of data tooling is forming the connective tissue between data science workflows and domain experts, but the abstractions Data Wrangler presents seem to be moving us in the opposite direction. We'll get to this in more detail when discussing Clarify, the SageMaker Studio tool that "detects bias in ML models."

So far, we’ve seen some wins and some misses for AWS. The apparent better integration with CloudFormation is a real win for usability I hope we see more of this from AWS. On the other hand, the steep learning curve and the UX shortcomings are still obstacles to data scientists looking to use the environment. This is born out in usage numbers: A 2020 Kaggle survey puts SageMaker usage among data scientists at 16.5%, even though overall AWS usage is 48.2% (mostly through direct access to EC2). For reference, JupyterLab usage is at 74.1%, and Scikit-learn at 82.8%. Surprisingly, this may be an area of strength for GCloud. While Google’s cloud service holds an embarrassing third-place ranking overall (behind Microsoft Azure and AWS), it holds a strong second place for data scientists according to the Kaggle Survey. Products like Google Colab, which only offer a fraction of the functionality of AWS SageMaker, are very good at what they do and have attracted some devoted fans in the data science community. Perhaps Google’s notorious engineering-first culture has translated into a more user-friendly experience in the cloud than its Seattle-based rival. We have certainly noticed that the documentation is kept a little better in sync and that the developer experience is a little sharper.

As we mentioned last year, user-centric design will be key in winning the cloud race, and while Sagemaker has made significant strides in that direction, it still has a ways to go.

Join us in part 2, where we talk about Pipelines, Feature Store, Clarify, and the ML industry’s darker parts.

Tianhui Michael Li is president at Pragmatic Institute and the founder and president of The Data Incubator, a data science training and placement firm. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw.

Hugo Bowne-Anderson is Head of Data Science Evangelism and VP of Marketing at Coiled. Previously, he was a data scientist at DataCamp, and has taught data science topics at Yale University and Cold Spring Harbor Laboratory, conferences such as SciPy, PyCon, and ODSC, and with organizations such as Data Carpentry. [Full Disclosure: As part of its services, Coiled provisions and manages cloud resources to scale Python code for data scientists, and so does offer something that SageMaker also does as part of its services. But it's also true that all-one-platforms such as SageMaker and products such as Coiled can be seen as complementary: Coiled has several customers who use SageMaker Studio alongside Coiled.]

One-click provisioning makes it easier to get started

Sagemaker has improved but usability is still lacking, hindering adoption

Data Wrangler: right problem, wrong approach

More