Onehouse brings a fully-managed lakehouse to Apache Hudi

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Much has been written about the respective benefits of date warehouses and date lakes. The former is generally a better solution for housing transformed, structured data that's easier to query. The latter, however, is best for unstructured data in its purest, most flexible form that can be queried in more-or-less limitless ways as a company's needs evolve.

There are pros and cons to both, of course, but a third data management architecture has recently emerged that meshes the best of lakes and warehouses -- it's called, somewhat predictably, a data lakehouse. Built on an open data architecture, a data lakehouse can manage all data formats, including structured, unstructured, or semi-structured. Addiitonally data lakehouses can support multiple data workloads, and can be deployed on top of low-cost cloud storage solutions, similar to a data lake.

With that in mind, a new company called Onehouse emerged from stealth this week with a mission to bring the benefits of data lakehouses to the enterprise. It plans to do this by selling a managed service on top of the Apache Hudi open source project, which was developed internally at Uber back in 2016 to bring data warehouse-like functionality to data lakes. In the intervening years, Hudi has been adopted by major companies such as Amazon, Disney, and Bytedance.

Former Uber data architect and Hudi creator Vinoth Chandar set up Onehouse in early 2021, building on the community that has sprung up around the open source project over the previous five years -- it now claims in the region of 1 million downloads each month.

Onehouse officially launched out of stealth yesterday with $8 million in seed funding from Greylock Ventures and Addition.

"By combining breakthrough technology and a fully-managed easy-to-use service, organizations can build data lakes in minutes, not months, realize large cost savings and still own their data in open formats," Chandar wrote. "Onehouse aims to be the bedrock of your data infrastructure as the one home for all of your data."

Data challenges

While data has often been referred to as the "new oil," companies often have difficulties scaling their data architectures as they grow. They may start with a data warehouse for simpler business intelligence and analytics use-cases, but as their data increases and needs evolve -- particularly relating to AI and machine learning workloads. Companies typically turn to a data lake, given that it's cheaper to store data and can run more complex and advanced queries. But this comes at a cost.

"The investment in a lake comes with a whole new set of challenges around concurrency, performance, and a lack of mature data management," Chandar said. "Most companies end up living between a rock and a hard place, juggling data across both a lake and a warehouse."

Hudi goes some way toward solving that problem by bringing key warehouse features to data lakes, such as transactions, indexing, and scalable metadata. And that, essentially, is what Onehouse is looking to capitalize on.

While any well-resourced company could take Hudi and deploy it themselves, it requires a lot of time and effort -- building a data lake, or a data lakehouse, can take months. Onehouse takes much of the spadework out of that, by offering a cloud-native managed service that ingests, self-manages, and optimizes the data automatically.

"While a warehouse can just be ‘used’, a lakehouse still needs to be ‘built’," Chandar noted. "Having worked with many organizations on that journey for four years in the Apache Hudi community, we believe Onehouse will enable easy adoption of data lakes and future-proof the data architecture for machine learning and data science down the line."

Data challenges

More