Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Modern enterprises tend to have data in multiple different locations, which makes querying data for analytics and data science a challenge.
Today at its Datanova conference, Boston-based Starburst announced a series of updates to its Starburst Galaxy cloud and on-premises Enterprise platforms intended to help better enable enterprises to organize and query data.
Starburst’s tech leadership includes creators of the open-source Trino SQL query engine that got its start originally as the Presto query engine at Facebook in 2013. Trino is also at the foundation of Starburst’s commercial products, which helps organizations to query and organize data found in data lakes, in an approach that today is commonly referred to as a data lakehouse.
Among the updates coming to Starburst portfolio is the introduction of a concept known as a “data product,” which is a collated collection of data that can come from different sources. The data product grouping can then be more easily utilized for analytics and data science.
Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.
Starburst is also adding a new global search capability to help enterprises find data assets, as well as introducing a new data query acceleration capability called “Warp Speed.”
“Data lakes, in general, have gotten significantly better over the years, especially with the new table formats like Apache Iceberg, which solve a lot of the problems of the old-school data lakes,” Matt Fuller, cofounder and VP of product at Starburst told VentureBeat.
What is a data product anyway?
Apache Iceberg is a data lake table format, which provides some structure to content found in a data lake, making it easier to query. But what happens when an organization has multiple data lakes, or other data sources including databases? That’s where the data product concept fits in.
Starburst had been providing the data product capability in its Enterprise edition and is now bringing that capability to the Starburst Galaxy cloud. Fuller explained that a data product is a highly curated dataset.
The dataset could be something as simple as a table in the data lake that has been configured with the right permissions such that users can only see a specific subset of data that is pertinent to a certain use case. Fuller explained that, for example, a data product could also be a combination of data coming from the data lake and customer information located in a database. The end result is the user simply sees all the data that they need in one location that has been collected into the data product.
Beyond just collating data, Fuller said that the Starburst data product concept will also package the data with metadata, which provides ownership and lineage to help users feel confident in the quality of the data that has been collected.
Before organizations are able to build data products, they are going to need an understanding of what data they have. That’s where the new global search capability being added to Starburst will help. Fuller explained that global search enables organizations to discover data with a search interface that can then be connected into a Starburst cluster.
Warp Speed ahead for data queries
Back in June 2022, Starburst acquired Israeli Trino vendor Varada, which had been building a data query accelerator technology.
The Varada technology has been integrated into the Starburst platform under the product name Warp Speed. Fuller noted that even prior to the acquisition, Starburst had been partnering with Varada to help joint customers accelerate queries with an advanced data indexing and caching capability.
“It should just make everything faster now,” Fuller said.
That said, he noted that Warp Speed will benefit some workloads more than others. For example, complex queries that involve data aggregation where there are lots of input/output (I/O) operations will experience the greatest benefit.
Python support comes to Starburst
Trino is a SQL query engine, which means it requires that organizations generally use the SQL query language. A challenge for some in the past is the fact that in the world of data science, the open-source Python programming language is extremely popular.
To that end, Starburst is expanding its Python support, enabling organizations to migrate PySpark workloads to Starburst and Trino. PySpark is a popular open-source technology for using the Python language with the Apache Spark query engine.
“The two languages that are really important for data engineers are SQL, of course, and Python, too,” Fuller said. “People are going to use Python and we want to make sure that we can work really well with both a SQL and a Python interface to Starburst.”
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.