Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Watch here.

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Microsoft today announced the release of SynapseML (previously MMLSpark), an open source library designed to simplify the creation of machine learning pipelines. With SynapseML, developers can build “scalable and intelligent” systems for solving challenges across domains, including text analytics, translation, and speech processing, Microsoft says.

“Over the past five years, we have worked to improve and stabilize the SynapseML library for production workloads. Developers who use Azure Synapse Analytics will be pleased to learn that SynapseML is now generally available on this service with enterprise support [on Azure Synapse Analytics],” Microsoft software engineer Mark Hamilton wrote in a blog post.

Scaling up AI

Building machine learning pipelines can be difficult even for the most seasoned developer. For starters, composing tools from different ecosystems requires considerable code, and many frameworks aren’t designed with server clusters in mind.


MetaBeat 2022

MetaBeat will bring together thought leaders to give guidance on how metaverse technology will transform the way all industries communicate and do business on October 4 in San Francisco, CA.

Register Here

Despite this, there’s increasing pressure on data science teams to get more machine learning models into use. While AI adoption and analytics continue to rise, an estimated 87% of data science projects never make it to production. According to Algorithmia’s recent survey, 22% of companies take between one and three months to deploy a model so it can deliver business value, while 18% take over three months.

SynapseML aims to address the challenge by unifying existing machine learning frameworks and Microsoft-developed algorithms in an API, usable across Python, R, Scala, and Java. SynapseML enables developers to combine frameworks for use cases that require more than one framework, such as search engine creation, while training and evaluating models on resizable clusters of computers.

As Microsoft explains on the project’s website, SynapseML expands Apache Spark, the open source engine for large-scale data processing, in several new directions: “[The tools in SynapseML] allow users to craft powerful and highly-scalable models that span multiple [machine learning] ecosystems. SynapseML also brings new networking capabilities to the Spark ecosystem. With the HTTP on Spark project, users can embed any web service into their SparkML models and use their Spark clusters for massive networking workflows.”


SynapseML also enables developers to use models from different machine learning ecosystems through the Open Neural Network Exchange (ONNX), a framework and runtime co-developed by Microsoft and Facebook. With the integration, developers can execute a variety of classical and machine learning models with only a few lines of code.

Beyond this, SynapseML introduces new algorithms for personalized recommendation and contextual bandit reinforcement learning using the Vowpal Wabbit framework, an open source machine learning system library originally developed at Yahoo Research. In addition, the API features capabilities for “unsupervised responsible AI,” including tools for understanding dataset imbalance (e.g., whether “sensitive” dataset features like race or gender are over- or under-represented) without the need for labeled training data and explainability dashboards that explain why models make certain predictions — and how to improve the training datasets.

Where labeled datasets don’t exist, unsupervised learning — also known as self-supervised learning — can help to fill the gaps in domain knowledge. For example, Facebook’s recently announced SEER, an unsupervised model, trained on a billion images to achieve state-of-the-art results on a range of computer vision benchmarks. Unfortunately, unsupervised learning doesn’t eliminate the potential for bias or flaws in the system’s predictions. Some experts theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” biases.

“Our goal is to free developers from the hassle of worrying about the distributed implementation details and enable them to deploy them into a variety of databases, clusters, and languages without needing to change their code,” Hamilton said.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.