Treeverse raises $23M to bring Git-like version control to data lakes

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

While data lakes and data warehouses are conceptually similar, they are ultimately very different beasts. If a company is looking to house easy-to-query structured data for anyone to use, then a data warehouse is likely its best bet. Conversely, if the company wants to leverage big data in its purest, most flexible form, they are most likely looking for a data lake -- in its native unprocessed format, there are unlimited ways to query this data as a business' needs evolve.

However, massive data lakes constituting petabytes of different datasets can become unwieldy and difficult to manage. And this is a problem that fledgling startup Treeverse wants to solve with an open source platform called LakeFS, which is designed to help enterprises manage their data lake in a way similar to how they manage their code -- "transform your object storage into a Git-like repository," as the company puts it. This means version control and other Git-like operations such as branch, commit, merge, and revert; and full reproducibility of all data and code.

"The number one problem LakeFS solves is the manageability of large-scale data lakes featuring many datasets that are maintained by lots of different people -- at this scale, a lot of the workflows people are familiar with start to break," Treeverse cofounder and CEO Einat Orr told VentureBeat. "The Git-like operations exposed by LakeFS can solve these problems, similar to the way Git allows many developers to collaborate over a large codebase without causing code quality issues."

Founded out of Tel Aviv in 2020, Treeverse has largely flown under the radar before now, but today the Israeli company revealed that it has raised $23 million in a series A round of funding from Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures. The funding will be used to expedite both the development and adoption of LakeFS in enterprise data teams, while already laying claim to users at companies such as Slice, Similarweb, and Karius.

But where does LakeFS sit in the data stack, exactly? And what other tools might fit into that stack?

A modern enterprise data stack typically comprises various tools including data ingestion smarts from companies such as Fivetran and cloud-based data lakes or data warehouses like Snowflake or Google's BigQuery. The process of pooling data from multiple sources (e.g. CRM and marketing tools), unifying it into a standard format so that it's easy to run queries and analytics against, is usually done via "extract, transform, and load" (ETL), where the data is transformed before entry to the warehouse, or through "extract, load, and transform" (ELT), where the data is transformed on-demand within a warehouse or lake.

LakeFS sits between the ELT technology and the data lake. "Integrating ELT technologies with LakeFS enables writing new data to a designated branch, and testing it to ensure quality before exposing to consumers," Orr explained. "This workflow provides important guarantees about production data to consumers of the data."

Existing products on the market comparable to LakeFS include machine learning operations (MLOps) tools such as DVC, which is developed by a company called Iterative.ai that raised $20 million just last month, and Pachyderm. However, they are aimed chiefly at data scientists building machine learning models. "LakeFS takes an holistic infrastructure approach and provides data version control capabilities across all providers and consumers of data through the applications they use," Orr said.

Elsewhere, open table storage formats such as Databricks' Delta Lake offer something similar in terms of allowing "time travel" (reverting to data in a previous form) on a per-table basis, though LakeFS enables this over an entire data repository that could stretch across thousands of different tables.

Data play

There has been significant activity across the broader data engineering space of late. Fishtown Analytics recently rebranded as Dbt Labs and raised $150 million in funding at a $1.5 billion valuation to help analysts transform data in the warehouse, while Airbyte also secured venture backing this year before opening up its data integration platform to support data lakes. And GitLab recently spun out a new data integration platform called Meltano as an independent company.

One thing all these commercial companies have in common is that they are built on open source projects. And so the most obvious outstanding question when any young VC-backed company pitches its open source wares is this: What's your business model? For Treeverse, the answer to that question is that there is no immediate plans to monetize for now, though of course the longer-term plan is to build a commercial product on top of LakeFS.

"Our goal is to develop the open source project and foster a vibrant community around it," Orr explained. "Once we achieve our targets there, we'll shift focus to providing an enterprise version of LakeFS that offers common premium features like managed-hosting and predefined workflows that bring best practices and ensure high quality data and resilient pipelines."

Data play

More