ETL company Airbyte raises $5.2M to integrate open source data

Airbyte today announced it has raised $5.2 million in seed funding as part of an effort to make open source tools for managing and integrating data more accessible.

The company, which offers an open source extract transform and load (ETL) tool used to create data pipelines, is now seeking to further democratize that process. This includes, for example, building complementary open source tools to govern and secure data, Airbyte cofounder and CEO Michel Tricot told VentureBeat.

Internal IT teams have historically employed ETL tools to move data between repositories. In recent years, however, data analysts have been using these tools to load data into warehouses without requiring any intervention on the part of an IT team.

All of those tools are licensed by individuals. Airbyte plans to eventually provide versions of its tools licensed by organizations, along with an option to access those tools via a service hosted by Airbyte. Tricot said the company is also planning a managed integration service. "We won’t be focusing on monetization until 2022," he said.

Accel led the current round of funding, with participation from Y Combinator; 8VC; Segment cofounder Calvin French-Owen; former Cloudera GM Charles Zedlewski; Datavant cofounder and CEO Travis May; Machinify president Alain Rossmann; and Auren Hoffman, cofounder and CEO of LiveRamp and CEO of Safegraph.

As of the end of January, more than 600 organizations are using Airbyte ETL tools, including Safegraph, Dribbble, Mercato, GraniteRock, Agridigital, and Cart.com. Many of those organizations are attracted to Airbyte because they don't have to wait for a provider of commercial ETL tools to create connectors for various data sources. Instead, the community collaboratively builds and supports the connectors it deems most critical, Tricot said. The community has thus far certified 50 connectors. Those connectors are encapsulated Docker containers, which enables them to be deployed on any platform.

ETL processes, along with other classes of data preparation tools, are being reevaluated as organizations increasingly realize that the quality of any AI model they build is dependent on how reliable the data used to train machine learning algorithms is. Data scientists also want to be able to easily update data needed to retrain models as business conditions evolve, which usually entails having more direct control over what data sources are employed to train those models.

As critical as control over any dataset may be, data scientists are discovering that much of the data stored within enterprise systems is not all that consistent or reliable. Data science teams can easily find themselves spending more time addressing data plumbing issues than they do constructing AI models. Successfully building an AI model, as a consequence, can often require months of time and effort.

ETL tools are not going to resolve that issue on their own. But the easier data becomes to manipulate, the less time it will take to build an AI model and then continuously maintain it as new data sources become available.

It's not clear what impact the availability of open source ETL tools is having on providers of the rival commercial offerings some organizations have been employing for decades. But at a time when many organizations are under pressure to reduce the total cost of IT, the allure of open source software has proven undeniable.

More