Automating data pipelines: How Upsolver aims to reduce complexity

Upsolver’s value proposition is interesting, particularly for those with streaming data needs, data lakes and data lakehouses, and shortages of accomplished data engineers. It’s the subject of a recently published book by Upsolver's CEO, Ori Rafael, Unlock Complex and Streaming Data with Declarative Data Pipelines.

Instead of manually coding data pipelines and their plentiful intricacies, you can simply declare what sort of transformation is required from source to target. Subsequently, the underlying engine handles the logistics of doing so largely automated (with user input as desired), pipelining source data to a format useful for targets.

Some might call that magic, but it’s much more practical.

"The fact that you’re declaring your data pipeline, instead of hand coding your data pipeline, saves you like 90% of the work," Rafael said.

Consequently, organizations can spend less time building, testing and maintaining data pipelines, and more time reaping the benefits of transforming data for their particular use cases. With today’s applications increasingly involving low-latency analytics and transactional systems, the reduced time to action can significantly impact the ROI of data-driven processes.

Underlying complexity of data pipelines

To the uninitiated, there are numerous aspects of data pipelines that may seem convoluted or complicated. Organizations have to account for different facets of schema, data models, data quality and more with what is oftentimes real-time event data, like that for ecommerce recommendations. According to Rafael, these complexities are readily organized into three categories: Orchestration, file system management, and scale. Upsolver provides automation in each of the following areas:

Integrating data

Other than the advent of cloud computing and the distribution of IT resources outside organizations' four walls, the most significant data pipeline driver is data integration and data collection. Typically, no matter how effective a streaming source of data is (such as events in a Kafka topic illustrating user behavior), its true merit is in combining that data with other types for holistic insight. Use cases for this span anything from adtech to mobile applications and software-as-a-service (SaaS) deployments. Rafael articulated a use case for a business intelligence SaaS provider, "with lots of users that are generating hundreds of billions of logs. They want to know what their users are doing so they can improve their apps."

Data pipelines can combine this data with historic records for a comprehensive understanding that fuels new services, features, and points of customer interactions. Automating the complexity of orchestrating, managing the file systems, and scaling those data pipelines lets organizations transition between sources and business requirements to spur innovation. Another facet of automation that Upsolver handles is the indexing of data lakes and data lakehouses to support real-time data pipelining between sources.

"If I’m looking at an event about a user in my app right now, I’m going to go to the index and tell the index what do I know about that user, how did that user behave before?" Rafael said. "We get that from the index. Then, I’ll be able to use it in real time."

Data engineering

Upsolver’s major components for making data pipelines declarative instead of complicated include its streaming engine, indexing and architecture. Its cloud-ready approach encompasses "a data pipeline platform for the cloud and… we made it decoupled so compute and storage would not be dependent on each other," Rafael remarked.

That architecture, with the automation furnished by the other aspects of the solution, has the potential to reshape data engineering from a tedious, time-consuming discipline to one that liberates data engineers.

Underlying complexity of data pipelines

Integrating data

Data engineering

More