LakeFS brings branching to data lakes

Can enterprises find a better way to organize the relentless onslaught of data? LakeFS thinks the answer: versioning a la Git. LakeFS offers the opportunity to create and track different versions of data, essentially imitating the process that developers use to organize the code.

On June 27, the company announced general availability of their service, LakeFS Cloud. Teams will be able to use it to follow the evolution of various versions of their data just as they do with different versions of their code.

“LakeFS is actually an infrastructure. It sits on top of the data,” explains Einat Orr, a cofounder and CEO of LakeFS. “It is an interface between the data lake and the applications. So any application can enjoy the Git-like operations that LakeFS offers, and the data is managed through one consistent interface for the organization.”

For a long time, developers have treated software and data differently. The programmers created versioning systems like Git to help organize software development by tracking the various small and large changes. Teams rely on the tool to keep the work of different programmers separate until it’s time to merge and ship a final version. Software teams routinely work with dozens, hundreds or even thousands of different versions arranged in a metaphorical tree with branches.

Data, though, has generally been stored in separate chunks. Developers often make complete copies of different snapshots or backups taken at different times. Tracking the differences was difficult and the proliferation of copies created confusion and large bills for storage.

“The cloud never warned us about the data getting clouded. As the blessing of infinite storage quickly became an unmanageable mess, there is a need for technologies like LakeFS to make data accessible again,” explained Sivan Bercovici, CTO at medical diagnostics company Karius, which has been testing the product with its work on artificial intelligence and data collection.

LakeFS: Systems and services

LakeFS is designed to work with object stores like S3 and different data management systems like Snowflake or BigQuery. The service offers one interface for storage and retrieval and then passes the data on to a backend service like AWS while tracking the current branching. LakeFS imagines that groups may work with several different storage providers. A demonstration playground offers users the chance to try the code without installing it.

The system would help teams by tracking the different branches and merging them only when necessary. A developer might start experimenting with a new feature by creating a branch of the main dataset that’s currently in production. There would be no need to make a complete copy for testing and any changes introduced by the new experiments would be kept in a separate branch that wouldn’t affect the main production version.

“It's very easy to create a mess in S3 and have copies lying around for years that no one deletes,” said Orr. “When you work with LakeFS, you have the transparency to manage your data properly and the ability to have your retention tied to your business needs because you know that this branch is not being used. You know that this file is not being pointed to by any LakeFS branch.”

LakeFS offers developers the option to create different branches and then merge them or delete them as necessary. It also offers webhooks so the operations can be integrated with a number of other development pipelines used for continuous integration and deployment.

“Since introducing LakeFS to our production data environment, we’ve enjoyed the benefits of atomic and isolated operations in our data pipelines. This has allowed us to spend more time improving other aspects of our data platform, and less time dealing with the fallout from race conditions and partially failed operations,” writes Lior Resisi, data platform team lead at Windward.

Data lake competitors

Several other database companies are starting to roll out similar approaches. Both Planetscale and Neon, for example, offer the opportunity to branch or fork data stored in their systems built around open-source databases like MySQL or PostgreSQL. They launched their versions recently and concentrated on offering the same database interface that developers have grown accustomed to over the years.

LakeFS is designed to work at a lower level with arbitrary object storage. The API accepts calls for blocks of data that are stored in buckets. Branching information is stored alongside as metadata and used, when necessary, to merge or delete objects.

“I think it's important to emphasize that we are format agnostic and that we are very complementary to open table formats such as Delta Lake or Iceberg,” explained Orr. This allows developers to work with large, more diverse datasets that are often spread out between different products and silos.

The company promises, though, that they will expand their interfaces to work with other storage options. They imagine that LakeFS can become a common API for developers to use. The savings on time and the storage fees for extra copies will justify the extra cost.

“That’s our vision,” says Orr. “At the end of the day it's not to work only over object storages, but on all data sources that you have.”

The product began as an open-source project sponsored by Treeverse, a U.S. company founded in 2020 by Orr and Oz Katz. Investors include Dell Technologies Capital, Norwest Venture Partners and Zeev Ventures.

LakeFS: Systems and services

Data lake competitors

More