Big Data

Amazon taps Hadoop-based Impala to speed up customers' big data queries

Werner Vogels, chief technology officer of

Above: Werner Vogels, chief technology officer of

Image Credit: Andrew Mager/Flickr

Since 2009, Amazon Web Services has offered customers the capability to run queries on large data sets with an open-source tool called Hive, although since then many of the largest, hippest companies on the Web have adopted their own tools to do this sort of thing more quickly.

Now many other companies will be better able to keep up. While cloud observers were talking about Dell partnerships yesterday, Amazon announced support for Impala, one of the few tools out there that move considerably faster than Hive.

Impala is an interactive query engine for data that sits in servers running Hadoop Distributed File System, an open-source program for handling and ensuring the availability of large quantities of data. The engine was developed by Hadoop distribution vendor Cloudera, which saw that Google was thinking up a somewhat similar technology.

As is the case with Hive, Impala supports widely known SQL-style query language, meaning that users need not learn any special commands to use it.

In bolstering its Elastic MapReduce product with Impala support, Amazon is racing ahead of competitors to make it easier to handle big supplies of data with computers available in a public cloud. Rackspace, Microsoft, and IBM have all added support for Hadoop this year, and now Amazon is taking a big step ahead.

Amazon’s moves with Impala come a month after the company announced Kinesis, a big data service that takes data from multiple sources and immediately shoots it over to other Amazon cloud services, so applications can incorporate real-time data.

More generally, Amazon continues to add flavors of computing, enhance existing services, and lower prices. All these kinds of steps help keep Amazon ahead in the public cloud, while others are pushing hard to grow their own clouds.

Now Amazon customers can use Impala to get data loaded business-intelligence software more quickly, and to pick up on trends much more quickly than they could with Hive. One thing to watch out for, though, is that Impala is more memory-intensive than Hive is.

Facebook went off on its own and built a tool that could run interactive queries on data at its massive scale. Last month it released the engine, Presto, under an open-source-license.Engineers could stitch together such open-source tools with their existing data sources, but the process could take time and effort. Through supporting Impala, Amazon is once again attempting to make life easier — and raising the bar for other public cloud providers.

VentureBeat is studying mobile marketing automation. Chime in, and we’ll share the data.