Dataiku releases new version of unified AI platform for machine learning

Dataiku recently released version 10 of its unified AI platform. VentureBeat talked to Dan Darnell, head of product marketing at Dataiku and former VP of product marketing at H2O.ai, to discuss how the new release provides greater governance and oversight of the enterprise's machine learning efforts, enhances ML ops, and enables enterprises to scale their ML and AI efforts.

Governance and oversight

For Darnell, the name of the game is governance. "Until recently," he told VentureBeat, "data science tooling at many enterprises has been the wild west, with different groups adopting their favorite tools." However, he sees a noticeable change in tooling becoming consolidated "as enterprises are realizing they lack visibility into these siloed environments, which poses a huge operational and compliance risk. They are searching for a single ML repository to provide better governance and oversight." Dataiku is not alone in spotting this trend, with competing products like AWS MLOps tackling the same space.

Having a single point of governance is helpful for enterprise users. Darnell likens it to a single "watchtower, from which to view all of an organization's data projects." For Dataiku, this enables project workflows that provide blueprints for projects, approval workflows that require managerial sign-off before deploying new models, risk and value assessment to score their AI projects, and a centralized model registry to version models and track model performance.

For its new release, governance is centered around the "project," which also contains the data sources, code, notebooks, models, approval rules, and markdown wikis associated with that effort. Just as GitHub went beyond mere code hosting to hosting the context around coding that facilitates collaboration, such as pull requests, CI/CD, markdown wikis, and project workflow, Dataiku's eponymous "projects" aspire to do the same for data projects. "Whether you write your model inside Dataiku or elsewhere, we want you to put that model into our product," said Darnell.

ML ops

Governance and oversight also extend into the emerging field of ML ops, a rapidly growing discipline that applies several DevOps best practices for machine learning models. In its press release, Dataiku defines ML ops as helping "IT operators and data scientists evaluate, monitor and compare machine learning models, whether under development or in production." In this area, Dataiku competes against products like Sagmaker's Model Monitor, GCP's Vertex AI Model Monitoring, or Azure's MLOps.

Automatic drift analysis is an important newly released feature. Over time, data can fluctuate due to subtle underlying changes outside the modeler's control. For example, as the pandemic progressed and consumers began to see delays in gym re-openings, sales of home exercise equipment began creeping up. This data drift can lead to poor performance for models that were trained on out-of-date data.

What-If scenarios are one of the more interesting features of the new AI platform. Machine learning models usually live in code, accessible only to trained data scientists, data engineers, and the computer systems that process them. But nontechnical business stakeholders want to see how the model works for themselves. These domain experts often have significant knowledge, and they often want to get comfortable with a model before approving it. Dataiku what-if "simulations" wrap a model so that non-technical stakeholders can interrogate the model by setting different inputs in an interactive GUI, without diving into the code. "Empowering non-technical users as part of the data science workflow is a critical component of MLOps," Darnell said.

Scaling ML and AI

"We think that ML and AI will be everywhere in the organization, and we have to unlock the bottleneck of the data scientist being the only person who can do ML work," Darnell said.

One way Dataiku is tackling it is to reduce the duplicative work of data scientists and analysts. Duplicative work is the bane of any large enterprise where code silos are rampant. Data scientists redo the work because they simply don't know if it was done elsewhere. A catalog of code snippets can provide data scientists and analysts greater visibility on prior work so that they can stand on the shoulders of colleagues rather than reinvent the wheel. Whether or not the catalog can work will hinge on search performance -- a notoriously tricky problem -- as well as whether search can easily identify the relevant prior work, therefore freeing up data scientists to accomplish more valuable tasks.

In addition to trying to make data scientists more effective, Dataiku's AI platform also provides no-code GUIs for data prep and AutoML capabilities to perform ETL, train models, and assess their quality. This feature is geared at technically-proficient users who cannot code and empowers them to do many of the data science tasks. Through a no-code GUI, users can control which ML models are available to the AutoML algorithm and perform basic feature manipulations on the input data. After training, the page provides visuals to aid in model interpretability, not just regression coefficients, hyperparameter selection, and performance metrics, but more sophisticated diagnostics like subpopulation analysis. The latter is very helpful for AI bias, where model performance may be very strong overall but weak for a vulnerable subpopulation, leading to bias. No-code solutions are hot, with AWS also releasing Sagemaker Canvas, a competing product.

More on Dataiku

Dataiku's initial product, the "Data Science Studio," focused on providing tooling for the individual data scientist to become more productive. With Dataiku 10, its focus is shifted to the enterprise, with features that target the CTO as well as the rank and file data scientist. This shift is not uncommon among data science vendors chasing stickier seven-figure enterprise deals with higher investor multiples. This direction mirrors similar moves by well-established competitors in the cloud enterprise data science space, including Databricks, Oracle's Autonomous DataWarehouse, GCP Vertex, Microsoft's Azure ML, and AWS Sagemaker, which VentureBeat has written about previously.