Data chess game: Databricks vs. Snowflake, part 1

This is the first of a two-part series. Read part 2, which looks at Databricks, MongoDB and Snowflake are making moves for the enterprise

Editor's note: A previous version of this article incorrectly stated that Databricks, unlike Snowflake, "runs within a single region and cloud, as the Databricks service does not currently have cross-region or cross-cloud replication features." This statement has been removed.

June was quite a month by post-lockdown standards. Not only did live events return with a vengeance after a couple years of endless Zoom marathons, but the start of summer saw a confluence of events from arguably the data world’s hottest trio: in sequential order, MongoDB, Snowflake and Databricks.

There may be stark and subtle differences in each of their trajectories, but the common thread is that each is aspiring to become the next-generation default enterprise cloud data platform (CDP). And that sets up the next act for all three: Each of them will have to reach outside their core constituencies to broaden their enterprise appeal.

Because we’ve got a lot to say from our June trip report with the trio of data hotshots, we’re going to split our analysis into two parts. Today, we’ll focus on the chess game between Databricks and Snowflake. Tomorrow, in part 2, we’ll make the case for why all three companies must step outside their comfort zones if they are to become the next-generation go-to data platforms for the enterprise.

The data lakehouse sets the agenda

We noted that with analytics and transaction processing, respectively, MongoDB and Snowflake may eventually be on a collision course. But for now, it’s all about the forthcoming battle for hearts and minds in analytics between Databricks and Snowflake, and that’s where we’ll confine our discussion here.

The grand context is the convergence of data warehouse and data lake. About five years ago, Databricks coined the term “data lakehouse,” which subsequently touched a nerve. Almost everyone in the data world, from Oracle, Teradata, Cloudera, Talend, Google, HPE, Fivetran, AWS, Dremio and even Snowflake have had to chime in with their responses. Databricks and Snowflake came from the data lake and data warehousing worlds, respectively, and both are now running into each other with the lakehouse. They’re not the only ones, but both arguably have the fastest growing bases.

The lakehouse is simply the means to the end for both Databricks and Snowflake as they seek to become the data and analytics destination for the enterprise.

To oversimplify, Snowflake invites the Databricks crowd with Snowpark, as long as they are willing to have their Java, Python or Scala routines execute as SQL functions. The key to Snowpark is that data scientists and engineers don’t have to change their code.

Meanwhile, Databricks is inviting the Snowflake crowd with a new SQL query engine that’s far more functional and performant than the original Spark SQL. Ironically, in these scuffles, Spark is currently on the sidelines: Snowpark doesn’t (yet) support Spark execution, while the new Databricks SQL, built on the Photon query engine, doesn't use Spark.

The trick question for both companies is how to draw the Python programmer. For Snowflake, the question is whether user-defined functions (UDFs) are the most performant path, and here, the company is investing in Anaconda, which is optimizing its libraries to run in Snowpark. Databricks faces the same question, given that Spark was written in Scala, which has traditionally had the performance edge. But with Python, the differences may be narrowing. We believe that Snowflake will eventually add capability for native execution in-database of Python and perhaps Spark workloads, but that will require significant engineering and won't happen overnight.

Meanwhile, Databricks is rounding out the data lakehouse, broadening the capabilities of its new query engine while adding a Unity Catalog as the foundation for governance, with fine-grained access controls, data lineage and auditing, and leveraging partner integrations for advanced governance and policy management. Andrew Brust provided the deep dive on the new capabilities for Delta Lake and related projects such as Project Lightspeed in his coverage of the Databricks event last month.

Who’s more open, and does it matter?

Databricks and Snowflake also differ on open source. This can be a subjective concept, which we’ve documented here, here, here, here, and here, and we’re not about to revisit the debate again. Been there, done that.

Suffice it to say that Databricks claims that it’s far more open than Snowflake, given its roots with the Apache Spark project. It points to enterprises that run Presto, Trino, DIY Apache Spark or commercial data warehouses directly on Delta without paying Databricks. And it extends the same argument to data sharing, as we’ll note below. To settle the argument on openness, Databricks announced that remaining features of Delta Lake are now open source.

Meanwhile, Snowflake makes no apologies for adhering to the traditional proprietary mode, as it maintains that's the most effective way to make its cloud platform performant. But Snowpark's APIs are open to all comers, and if you don't want to store data in Snowflake tables, it’s just opened support for Parquet files managed by open-source Apache Iceberg as the data lake table format. Of course, that leads to more debates as to which open-source data lake table storage is the most open: Delta Lake or Iceberg (OK, don’t forget Apache Hudi). Here’s an outside opinion, even if it isn't truly unbiased.

Databricks makes open source a key part of its differentiation. But excluding companies like Percona (which makes its business delivering support for open source), it’s rare for any platform to be 100% open source. And for Databricks, features such as its notebooks and the Photon engine powering Databricks SQL are strictly proprietary. As if there’s anything wrong with that.

Now the hand-to-hand combat

Data warehouses have been known for delivering predictable performance, while data lakes are known for their capability to scale and support polyglot data and the ability to run deep, exploratory analytics and complex modeling. The data lakehouse, a concept introduced by Databricks nearly five years ago, is intended to deliver the best of both worlds, and to its credit, the term has been adopted by much of the rest of the industry. The operable question is, can data lakehouses deliver the consistent SLAs produced by data warehouses? That’s the context behind Databricks’ promotion of Delta Lake, which adds a table structure to data stored in open-source Parquet files.

That set the stage for Databricks’ TPC-DS benchmarks last fall, which Andrew Brust put in perspective, and of course, Snowflake responded. At the conference, Databricks CEO Ali Ghodsi updated the results. Watching him extoll the competitive benchmarks vs. Snowflake rekindled cozy recollections of Larry Ellison unloading on Amazon Redshift with Autonomous Database. We typically take benchmarks with grains of salt, so we won’t dwell on exact numbers here. Suffice it to say that Databricks claims superior price performance over Snowflake by orders of magnitude when accessing Parquet files. Of course, whether this reflects configurations representative for BI workloads is a matter for the experts to debate.

What’s interesting is that Databricks showed that it wasn’t religiously tied to Spark. Actually, here’s a fun fact: We learned that roughly 30% of workloads run on Databricks are not Spark.

For instance, the newly released Photon query engine is a complete rewrite, rather than an enhancement of Spark SQL. Here, Databricks replaced the Java code, JVM constructs and the Spark execution engine with the proven C++ used by all the household names. C++ is far more stripped down than Java and the JVM and is far more efficient with managing memory. The old is new again.

Sharing data, spreading the footprint

This is an area where Snowflake sets the agenda. It introduced the modern concept of data sharing in the cloud roughly five years ago with the data sharehouse, which was premised on internal line organizations sharing access and analytics on the same body of data without having to move it.

The idea was a win-win for Snowflake because it provided a way to expand its footprint within its customer base, and since the bulk of Snowflake’s revenue comes from compute, not storage, more sharing of data means more usage and more compute. Subsequently, the hyperscalers hopped on the bandwagon, adding datasets to their marketplaces.

Fast forward to the present and data sharing is behind Snowflake's pivot from cloud data warehouse to data cloud. Specifically, Snowflake cloud should be your organization's destination for analytics. A key draw of Snowflake data sharing is that, if the data is within the same region of the same cloud, it doesn't have to move or be replicated. Instead, data sharing is about the granting of permissions. The flip side is that Snowflake’s internal and external data sharing can extend across cloud regions and different clouds, as it does support the necessary replication.

The latest update to Snowflake Data Marketplace, which is now renamed Snowflake Marketplace, is that data providers can monetize their data and, in a new addition, their UDFs via a Native Application Framework, which certifies that those routines will run within Snowpark. They can sell access to the data and native apps sitting in Snowflake without having to pay any commission to Snowflake. The key is that this must happen within the Snowflake walled garden as the marketplace only covers data and apps residing in Snowflake.

Last month, Databricks came out with its answer, announcing the opening of internal and external data marketplaces. The marketplace goes beyond datasets to include models, notebooks and other artifacts. One of the features of Databricks marketplace is data cleanrooms, in which providers maintain full control over which parties can perform what analysis on their data without exposing any sensitive data such as personally identifiable information (PII), a capability that Snowflake already had.

There are several basic differences between the Snowflake and Databricks marketplaces, reflecting policy and stage of development. The policy difference is about monetization, a capability that Snowflake just added while Databricks purposely refrained. Databricks’ view is that data providers will not likely share data via disintermediated credit card transactions, but will instead rely on direct agreements between providers and consumers.

The hands-off policy by Databricks to data and artifacts in its marketplace extends to the admission fee, or more specifically, the lack of one. Databricks says that providers and consumers in its marketplace don’t have to be Databricks subscribers.

Until recently, Databricks and Snowflake didn’t really run into each other as they targeted different audiences: Databricks focusing on data engineers and data scientists developing models and data transformations, working through notebooks, while Snowflake appealed to business and data analysts through ETL and BI tools for query, visualization and reporting. This is another case of the sheer scale of compute and storage in the cloud eroding technology barriers between data lakes and data warehousing, and with it, the barriers between different constituencies.

Tomorrow, we’ll look at the other side of the equation. Databricks and Snowflake are fashioning themselves into data destinations, as is MongoDB. They are each hot-growth database companies, and they will each have to venture outside their comfort zones to get there.

Stay tuned.

This is the first of a two-part series. Tomorrow’s post will outline the next moves that Databricks, MongoDB and Snowflake should take to appeal to the broader enterprise.

The data lakehouse sets the agenda

Who’s more open, and does it matter?

Now the hand-to-hand combat

Sharing data, spreading the footprint

More