Dremio launches data lake service running on AWS cloud

Dremio today launched a cloud service that creates a data lake based on an in-memory SQL engine that launches queries against data stored in an object-based storage system.

The goal is to make it easier for organizations to take advantage of the data lake, dubbed Dremio Cloud, without having to employ an internal IT team to manage it, said Tomer Shiran, chief product officer for Dremio. An organization can now start accessing Dremio Cloud in as little as five minutes, he said.

Based on Dremio's existing SQL Lakehouse platform, the Dremio Cloud service runs on the Amazon Web Services (AWS) public cloud. It provides all the benefits of a data warehouse on a platform that employs an object-based storage system to reduce the total cost of building a data lake, noted Shiran.

Building the Dremio Cloud

Dremio Cloud is based on a microservices architecture that includes a service mesh to make infrastructure resources available on-demand via the Dremio Cloud control plane. As a result, customers incur no Dremio or AWS costs when the platform is idle, said Shiran.

That approach also eliminates the need to aggregate tables, extract data, or employ a separate online analytic processing (OLAP) cube to structure data in a way that is compatible with SQL, he added. It also means you don't need to copy data stored in an object-based storage system into a proprietary data warehouse to provide access to SQL-based applications, added Shiran.

Data is encrypted both at rest and in transit using key management tools that ensure secure communication between the clients, control plane, and data plane. Role-based access controls (RBAC) enable companies to define privileges on every dataset and object in the system. In addition, companies can invoke existing user and group definitions in Dremio using identity management platforms such as Okta to enforce zero-trust security policies, said Shiran. Dremio Cloud has already achieved SOC 2 compliance, he added.

Dremio recently launched a Dart Initiative to improve the performance of SQL queries by a factor of five over the next 12 months with proprietary acceleration technologies it has developed. At the core of that effort is Gandiva, a toolkit that enables vectorized execution on modern processors using the in-memory buffers within Apache Arrow, an open source columnar data format Dremio co-created.

The company also maintains physically optimized representations of source data known as Data Reflections. The query optimizer can then accelerate a query by using one or more Data Reflections to partially or entirely surface query results without having to process raw data for every query launched.

Dremio also provides support for query plan caching, which eliminates both overhead and latency for repeated queries, in addition to a high-performance compiler that enables much larger and more complex SQL statements while employing machine learning algorithms to reduce the amount of compute resources required to launch SQL queries. Cloud storage read operations make up 30% to 60% of query execution costs in some workloads, Dremio says, and the company is reducing the amount of data read from cloud object storage by enhancing the scan filter pushdown capabilities it provides.

Making data lakes simpler

While the concept of a data lake has been around for some time now, many organizations have faltered when it comes to deploying them because managing petabytes of data at that scale has proven to be too challenging. A data lake based on Hadoop, for example, often quickly became a data swamp as more data is added. "Data teams are in a tough spot," said Shiran.

Dremio is addressing that issue by embedding a range of SQL acceleration and data management tools within its platform to optimize queries across a data lake based on object-storage systems that are readily available in cloud computing environments. The challenge now is convincing organizations that have historically relied on a traditional data warehouse to reconsider a data lake approach based on a platform that promises to make it simpler to access petabytes of data in the cloud.

Building the Dremio Cloud

Making data lakes simpler

More