At the AWS Summit in San Francisco today, public cloud infrastructure provider Amazon Web Services (AWS) announced the launch of Redshift Spectrum, an extension of AWS’ Redshift managed data warehousing service that enables querying on data that sits inside of the longstanding AWS S3 storage service.
The introduction of Redshift Spectrum will make certain types of queries on data more economical, because Redshift, which includes computing and storage capabilities, is a more complex and costly service especially for number crunching on lots of data.
As an example, Amazon chief technology officer Werner Vogels provided an example query that would run on an exabyte of data with the Apache Hive open-source data querying software. It would take five years and 1,000 nodes — that is, it would be quite expensive. With Spectrum it would take 155 seconds and a few hundred dollars, Vogels said.
“When you issue a query, Redshift rips it apart and generates a query plan that minimizes the amount of S3 data that will be read, taking advantage of both column-oriented formats and data that is partitioned by date or another key,” AWS chief evangelist Jeff Barr wrote in a blog post. “Then Redshift requests Spectrum workers from a large, shared pool and directs them to project, filter, and aggregate the S3 data. The final processing is performed within the Redshift cluster and the results are returned to you.”
Already Docomo, Time, Edmunds, Redfin, and Yelp are using AWS Redshift Spectrum, he said.
Competitors include startup Snowflake’s cloud data warehouse. Microsoft Azure and IBM’s public cloud also offer data warehousing services.
AWS introduced Amazon Redshift in 2012. S3 itself dates to 2006.