Apache Software Foundation updates Drill for broader SQL queries

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

The Apache Software Foundation (ASF) this week updated an open source Apache Drill tool that enables end users to query multiple data sources using SQL -- without waiting for enterprise IT teams to create schemas and set up pipelines.

End users can download Drill 1.19 to launch queries against Apache Cassandra, Elasticsearch, and Splunk platforms, in addition to querying XML files and REST application programming interfaces (APIs) without any schema required.

Other capabilities include support for the Avro protocol plugins based on the Apache Kafka messaging platform; Apache Airflow software for managing workflows; integrated password vaults to secure credentials; and Linux ARM64 systems.

Trajectory

Apache Drill first emerged as a SQL-based query engine designed to enable end users to interrogate data stored in NoSQL Apache Hadoop platforms. Since then, the number of data sources has steadily increased to the point that end users are employing the tool to interrogate data wherever it resides, said Charles Givre, vice president of Apache Drill and CEO of DataDistillr, a provider of SQL query tools based on Apache Drill.

That's critical because organizations struggle to aggregate all their data within a single data warehouse, Givre added. "It's practically impossible to get all your data in a data lake," he said.

Just as problematic, there's usually a significant time delay between when new data is created by an application and when that data becomes available in a data warehouse or data lake, Givre said. But Apache Drill makes it easier to launch SQL queries against the freshest set of data available, regardless of where it resides, he said.

In some cases, data science teams are setting up complex processes to analyze datasets when they could accomplish the same tasks more easily using Apache Drill to join two or more datasets without having to ever move any data, he added.

How it works

Apache Drill is designed to be deployed either on a single laptop or across a 1,000- node cluster that is processing trillions of records. It makes use of JavaScript Object Notation (JSON) formats to eliminate the need to define schemas beforehand or normalize data. Beyond Hadoop, it's compatible with Apache HBase, MongoDB, Elasticsearch, Cassandra, REST APIs, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, and a variety of other network-attached storage (NAS) formats. Apache Drill is also designed to be integrated with business intelligence tools, such as Apache Superset, Tableau, MicroStrategy, QlikView, and Excel.

IT organizations have for some time been trying to strike a balance between centrally managing data and enabling end users to interactively query data as they see fit. In many cases, end users have gotten around IT departments by setting up their own platforms and query tools. Beyond governance issues that might create, the data a business unit is employing to make decisions is usually out of sync with the data the rest of the business relies on.

Most enterprise IT teams don't have the political capital required to ban business units from using a given tool, however. Instead, Givre said they should focus on striking a balance between end users' need to easily query data as it becomes available and the need to manage terabytes of historical data that might reside in a data warehouse.

Regardless of the path organizations opt for when it comes to managing data, the number of tools and platforms for querying data is continuing to explode. The issue now is determining to what degree organizations should limit access to tools sanctioned by their IT team.

Trajectory

How it works

More