Databricks strikes back

The summer has barely started, but MongoDB World and Snowflake Summit are now past tense, even as the paint is still drying on all the announcements made at each event. With its Data + AI Summit kicking off as a hybrid virtual/in-person event in San Francisco today, Databricks is wasting no time responding, with a huge manifest of its own announcements.

Databricks' cofounder and chief technologist (and Apache Spark creator) Matei Zaharia briefed VentureBeat on all the announcements. They fall into two buckets: enhancements to open-source technologies underlying the Databricks platform -- like Apache Spark -- on the one hand, and enhancements, previews and general availability (GA) releases pertaining to the proprietary Databricks platform on the other.

Related:

In this post, I'll cover the full range of announcements. There's a lot here, so feel free to use the subheads as a kind of random access interface to read the bits you might care about most, then come back and read the rest if you have time.

Spark Streaming goes Lightspeed

Because Spark and its companion open-source projects have become de facto industry standards at this point, I'd like to start with the announcements in that sphere. First, to Spark itself, Databricks is making two roadmap announcements, covering streaming data processing as well as connectivity for Spark client applications. Spark Streaming has been a subproject of Spark for many years, and its last major enhancement -- a technology called Spark Structured Streaming -- GA'd five years ago. Essentially, this has meant that the tech around processing of streaming data on Spark has languished, a fact advocates of competing platforms had started to leverage.

In Zaharia's words, "We didn't have a very large streaming team, you know, after we built the Spark streaming APIs in the first three or four years of the company." Matei added, "We were just kind of maintaining that and we found it was one of the fastest growing workloads on our platform; it's growing faster than the rest."

This realization that Spark Streaming needed some love has resulted in an umbrella effort that Databricks is calling Project Lightspeed, to create a next-gen implementation of Spark Streaming. Databricks says Lightspeed will bring performance and latency improvements to processing streaming data; add new functionality, like advanced windowing and pattern matching; and make streaming operations easier in general.

Databricks has formed a new streaming team to drive Lightspeed and has named recent hire Karthik Ramasamy, formerly of Twitter and co-creator of Apache Pulsar, to lead it. Databricks also recently recruited Alex Balikov from Google Cloud, and has appointed him senior tech lead on the streaming team. Now let's wait and see if processing streaming data on Spark can become relatively manageable for the average developer.

RESTful access

Speaking of developers, another Spark roadmap announcement involves something called Spark Connect, which will essentially implement a REST API for Spark, both for operational tasks (like submitting jobs and retrieving results) and managerial ones (like sizing and load balancing Spark clusters or scheduling jobs). This will remove the hard requirement for using programming language- and version-specific client libraries and allow application developers to take a more loosely coupled approach to working with Spark, using just HTTP.

Delta Lake opens up

Sticking with open-source announcements but moving beyond Apache Spark proper brings us to two related projects, both domiciled at the Linux Foundation: Delta Lake and MLflow. Delta Lake is one of three popular technologies for bringing data warehouse-like functionality to data lakes stored in open storage formats like Apache Parquet. Delta Lake has seemingly been in the lead, but rival format Apache Iceberg has recently lurched ahead, seeing adoption at companies like Dremio, Cloudera and Snowflake. One of the chief criticisms of Delta Lake has been that Databricks has maintained overly-tight control of it and has co-mingled the open-source file format with Databricks-proprietary technology like time travel (which allows previous states of a dataset to be examined).

Perhaps in reaction to that criticism, Databricks is today announcing Delta Lake 2.0. The new version brings both performance enhancements and greater openness. Specifically, Databricks says it is contributing all of Delta Lake to the Linux Foundation open-source project, so that all adopters of the format can work with the same codebase and have access to all features.

MLflow, part deux

Open-source project MLflow forms the backbone of Databricks' MLOps capabilities. Although proprietary components, including the Databricks feature store, exist, the MLflow-based functionality includes machine learning experiments execution and management, as well as a model repository with versioning. Today, Databricks is announcing MLflow 2.0, which will add a major new feature, called Pipelines. Pipelines are templates for setting up ML applications, so everything's ready for productionalization, monitoring, testing and deployment. The templates -- based on code files and Git-based version control -- are customizable and allow monitoring hooks to be inserted. Although based on source code files, developers can interact with Pipelines from notebooks, providing a good deal of flexibility. Adding Pipelines should be a boon to the industry, as numerous companies, including all three major cloud providers, have either adopted MLflow as a standard or documented how to use it with their platforms.

Databricks SQL matures

There's a lot going on, on the proprietary side as well. To begin with, Databricks SQL's Photon engine, which brings query optimization and other data warehouse-like features to the Databricks platform, will be released to GA in July. Photon has recently picked up important enhancements, including support for nested data types and accelerated sorting capabilities.

Along with that, Databricks is releasing several open source connectors to Databricks SQL, for languages including Node.js, Python and Go. Databricks SQL is also getting its own command line interface (CLI), too, and will now sport a query federation feature, allowing it to join tables/data sets between different sources in the same query. The latter feature leverages Spark's own ability to query multiple data sources.

One interesting thing about Databricks SQL is that it supports different cluster types than are made available for other Databricks workloads. The special clusters, called SQL warehouses (and formerly called SQL endpoints), are "T-shirt-sized" and feature cloud server instances that are optimized for business intelligence-style queries. However, now a new option, Databricks SQL Serverless, which will allow customers to query their data via Databricks SQL without creating a cluster at all, is launching in preview on AWS.

Delta Live Tables

Want more? Delta Live Tables, the Databricks platform's SQL-based declarative facility for ETL and data pipelines, is getting several enhancements, including new performance optimization, Enhanced Autoscaling and change data capture (CDC), to make the platform compatible with slowly changing dimensions, and allowing them to be updated incrementally, rather than from scratch, when dimensional hierarchies change.

The last of these is important -- it allows analytical queries to run undisrupted when, for example, a certain branch office is reclassified as being in a different regional division. Queries covering a timespan when it was in its original division will attribute sales at that office to that division; queries covering a later time span will attribute sales to the new division, and queries spanning both will allocate the correct sales amounts to each of the respective divisions.

Catalog, Cleanrooms and Marketplace

Dataricks Unity Catalog will be released to GA later this summer, complete with new lineage capabilities that were just recently added. A new "Data Cleanrooms" feature will allow queries that span data from two different parties to be performed in the cloud without either party needing to send its data to the other. Instead, each party's data will be put into a kind of digital escrow and, provided both parties grant approval, jobs using both their data will be executed in Databricks' cloud, from which the data will subsequently be deleted.

Finally, Databricks is starting up its own marketplace, but with a couple of differences from typical data marketplace offerings. To begin with, Databricks Marketplace offerings can consist of whole solutions, including applications and examples, rather than datasets alone. And because the product is based on Delta Sharing, Databricks says it can be used by clients that are not actually using the Databricks platform itself.

Where this leads us

As the data and analytics space consolidates and the new generation of leaders emerges, the competition is getting fierce. The customer benefits as major players start to play in each other's territory, all looking to service analytical, operational, streaming, data engineering and machine learning workloads in a multicloud fashion. Databricks has doubled down on investments in certain of these areas and has expanded investments to others. What's especially nice about that is the cascading effect it has on several open source projects, including Spark, Delta Lake and MLflow.

Will Databricks eventually allow single clusters to span multiple clouds, or even turn its focus to on-premises environments? Will Delta Lake or Apache Iceberg emerge as the standard lakehouse storage technology? Will the Databricks feature store component get open sourced to round out MLflow's appeal versus commercial MLOps platforms? Will Databricks SQL Serverless slay Amazon Athena's business franchise? Watch this data space. Customers will place their bets in the next couple of years, as the lakehouse standard bearers build their momentum and map out their territory.