Databricks and Hugging Face integrate Apache Spark for faster AI model building

Databricks and Hugging Face have collaborated to introduce a new feature that allows users to create a Hugging Face dataset from an Apache Spark data frame. This new integration provides a more straightforward method of loading and transforming data for artificial intelligence (AI) model training and fine-tuning. Users can now map their Spark data frame into a Hugging Face dataset for integration into training pipelines.

With this feature, Databricks and Hugging Face aim to simplify the process of creating high-quality datasets for AI models. In addition, this integration offers a much-needed tool for data scientists and AI developers who require efficient data management tools to train and fine-tune their models.

Databricks says that the new integration brings the best of both worlds: cost-saving and speed advantages of Spark with memory-mapping and smart caching optimizations from Hugging Face datasets, adding that organizations would now be able to achieve more efficient data transformations over massive AI datasets.

Unlocking the full Spark potential

Databricks employees wrote and committed (revised the source code to the repository) Spark updates to the Hugging Face repository. Through a simple call to the from_spark function and by providing a Spark data frame, users can now obtain a fully-loaded Hugging Face dataset in their codebase that is ready for model training or tuning. This integration eliminates the need for complex and time-consuming data preparation processes.

Databricks claims that the integration marks a major step forward for AI model development, enabling users to unlock the full potential of Spark for model tuning.

“AI, at the core, is all about data and models," Jeff Boudier, head of monetization and growth at Hugging Face, told VentureBeat. "Making these two worlds work better together at the open-source layer will accelerate AI adoption to create robust AI workflows accessible to everyone. This integration significantly reduces the friction bringing data from Spark to Hugging Face datasets to train new models and get work done. We’re excited to see our users take advantage of it.”

A new way to integrate Spark dataframes for model development

Databricks believes that the new feature will be a game-changer for enterprises that need to crunch massive amounts of data quickly and reliably to power their machine learning (ML) workflows.

Traditionally, users had to write data into parquet files — an open-source columnar format, and then reload them using Hugging Face datasets. Spark dataframes were previously not supported by Hugging Face datasets, despite the platform’s extensive range of supported input types.

However, with the new “from_spark” function, users can now use Spark to efficiently load and transform their data for training, drastically reducing data processing time and costs.

“While the old method worked, it circumvents a lot of the efficiencies and parallelism inherent to Spark," said Craig Wiley, senior director of product management at Databricks. "An analogy would be taking a PDF and printing out each page then rescanning them, instead of being able to upload the original PDF. With the latest Hugging Face release, you can get back a Hugging Face dataset loaded directly into your codebase, ready to train or tune your models with.”

Dramatically reduced processing time

The new integration harnesses Spark’s parallelization capabilities to download and process datasets, skipping extra steps to reformat the data. Databricks claims that the new Spark integration has reduced the processing time for a 16GB dataset by more than 40%, dropping from 22 to 12 minutes.

“Since AI models are inherently dependent on the data used to train them, organizations will discuss the tradeoffs between cost and performance when deciding how much of their data to use and how much fine-tuning or training they can afford,” Wiley explained. “Spark will help bring efficiency at scale for data processing, while Hugging Face provides them with an evolving repository of open-source models, datasets and libraries that they can use as a foundation for training their own AI models.”

Contributing to open-source AI development

Databricks aims to support the open-source community through the new release, saying that Hugging Face excels in delivering open-source models and datasets. The company also plans to bring streaming support via Spark to enhance the dataset loading.

“Databricks has always been a very strong believer in the open-source community, in no small part because we’ve seen first-hand the incredible collaboration in projects like Spark, Delta Lake, and MLflow,” said Wiley.” We think it will take a village to raise the next generation of AI, and we see Hugging Face as a fantastic supporter of these same ideals.”

Recently, Databricks introduced a PyTorch distributor for Spark to facilitate distributed PyTorch training on its platform and added AI functions to its SQL service, allowing users to integrate OpenAI (or their own models in the future) into their queries.

In addition, the latest MLflow release supports the transformers library, OpenAI integration and Langchain support.

“We have quite a lot in the works, both related to generative AI and more broadly in the ML platform space,” added Wiley. “Organizations will need easy access to the tools needed to build their own AI foundation, and we’re working hard to provide the world’s best platform for them.”

Unlocking the full Spark potential

A new way to integrate Spark dataframes for model development

Dramatically reduced processing time

Contributing to open-source AI development

More