The two data startups intend to drive Spark into the hands of more data analysts through a formal partnership, Databricks and Alteryx have revealed to VentureBeat. Alteryx will become a primary committer to SparkR, part of the open-source, in-memory Spark engine often seen as the leading candidate to replace MapReduce, the company said.
MapReduce, originally conceived at Google, is the initial programming model for the Hadoop ecosystem of open-source tools for analyzing lots of different kinds of data. But while MapReduce boasts strong scalability, fault tolerance, and throughput, it generally runs jobs on a batch basis. That is quite limiting in terms of latency and accessibility, argued Alteryx chief operating officer George Mathew in a conversation with VentureBeat.
You need a custom MapReduce programmer every time you want to get something out of Hadoop, but that’s not the case for Spark, said Mathew. Alteryx is working toward a standardized Spark interface for asking questions directly against data sets, which broadens Spark’s accessibility from hundreds of thousands of data scientists to millions of data analysts — folks who know who to write SQL queries and model data effectively, but aren’t experts in writing MapReduce programming jobs in Java.
The Spark framework is well equipped to handle those queries, as it exploits the memory spread across all of the servers in a cluster. That means it can run analytics models at blazing-fast speeds compared to MapReduce: Programs can go as much as 100 times faster in memory or 10 times faster on disk. Those performance enhancements — and the subsequent customer demand — has prompted Hadoop distribution vendors like Cloudera and MapR to support Spark.
Databricks, founded by the creators of Spark, today announced $33 million in new funding, bringing its total venture financing to $47 million. It also revealed a new service for running Spark jobs and visualizing data on a Databricks-owned cloud. That’s another move by Databricks to make Spark as accessible as possible, a goal the Alteryx partnership will help push forward.
“We want to create a whole new generation of data blenders and analytics modelers that were never able to touch this stuff before,” Mathew said. “We’re just really excited to be working on this together.”
Alteryx will focus primarily on SparkR, while Databricks will focus largely on SparkSQL, according to Mathew.