Databricks brings deep learning to Apache Spark

Databricks is giving users a set of new tools for big data processing with enhancements to Apache Spark. The new tools and features make it easier to do machine learning within Spark, process streaming data at high speeds, and run tasks in the cloud without provisioning servers.

On the machine learning side, Databricks announced Deep Learning Pipelines, which are designed to make it possible for data scientists and AI novices to implement neural nets in their big data processing. It provides developers with high-level APIs designed to help with tasks like loading images, tuning a model’s hyperparameters, and modifying a more general model to help in a specific case.

The company has also integrated Spark’s Structured Streaming feature into the beta of its enterprise product to accelerate the processing of real-time data. The open source version of Structured Streaming has also been marked generally available. In Databricks’ tests, Spark performed five times faster than competing streaming engines on the Yahoo Streaming Benchmark as a result of the new streaming engine.

Those two announcements are aimed at helping companies deal with processing massive amounts of data, which is becoming increasingly important. Spark is one of the key technologies for serving the big data market, and these tools help make it accessible to more people and expand how useful Spark experts can be.

Deep Learning Pipelines enable developers to run pre-trained models as well as models built using Keras and TensorFlow. Another feature makes it possible to use a technique called transfer learning for taking a general model and making it more focused on a specific case.

One of the key benefits of using Deep Learning Pipelines in Spark is that it simplifies the integration deep learning models into applications.

On top of all that, Databricks also announced its Serverless Platform for Apache Spark, which lets developers share a pool of computing resources for running data processing tasks without having to manage their Spark deployments. It’s a move by the company to hop on the serverless bandwagon and provide developers with a way to get work done without worrying about overhead.

More