Facebook tops itself with an even faster tool for querying big data in Hadoop

Updated at 1 p.m. Pacific time on Nov. 7.

Just when big data vendors got used to Hive, the Facebook-created open-source tool for querying big datasets on Hadoop, here comes an even faster alternative.

Called Presto, the new tool also comes from Facebook -- and, like Hive, it too has now been released under an open-source license, a few months after it was publicly disclosed at a Facebook conference.

You'd think that a new, faster, open-source tool would be cause for celebration, right? Or at least you'd think companies commercializing Hive and similar tools to stop what they were doing and immediately start supporting Presto, or even building on what they have.

Not exactly. Existing open-source interactive querying engines are plenty fast, said executives from two companies offering support for parts of the Hadoop ecosystem. Still, there could be a few things the companies could glean from Presto.

After all, Facebook is a heavy-duty user of Hadoop, a family of open-source technologies that includes a file system well suited for large data sets and several analytical tools. Hive is among the most popular of those tools, enabling users to ask questions of data in the Hadoop Distributed File System with a modified version of the well-established SQL query language. Facebook pioneered Hive and open-sourced it in 2008.

But Hive, relying as it does on the powerful but generally slow batch-processing system MapReduce, is not the ideal program for multiple users to scan across an ever-growing data warehouse. It's not fast enough. So Facebook engineers began developing Presto, even as Cloudera was building a new query engine from the ground up called Impala. (An earlier version of this story called Impala "a sped-up version of Hive," but a Cloudera spokesperson informed us that the technology is a Hive substitute with differences that provide it with certain "performance, security, and ANSI SQL capabilities.") A few months later, Hortonworks said it would accelerate Hive in new versions.

It turns out Presto isn't just something Facebook analysts have been using. The new Presto website shows use of the technology by two well-known companies that have taken on plenty of venture capital and could conceivably pay money to get support for similar products: Airbnb and Dropbox.

Editor's note: Our upcoming DataBeat/Data Science Summit, Dec. 4-Dec. 5 in Redwood City, will focus on the most compelling opportunities for businesses in the area of big data analytics and data science. Register today!

Christopher Gutierrez, Airbnb's manager of online analytics, provides a quote suggesting certain advantages over Amazon Web Services' Redshift data warehouse service. And Fred Wulff, a Dropbox software engineer, is quoted as saying Presto has been "rock solid and extremely fast when applied to some of our most important ad hoc use cases."

One would think rhetoric like that might make Hadoop distribution vendors tremble out of fear that companies would just bypass the open-source-with-support option and go directly for Presto.

But Dave McJannet, the vice president of marketing at Hortonworks, didn't sound nervous about the early interest in Presto. Now, if staid enterprises start clamoring for it, that would be a different story.

"Our whole approach is about ensuring 100 percent open-source Hadoop is enterprise-grade for everybody," McJannet said. "If and when, you know, commercial enterprises show interest in these new and emerging technologies, we'll absolutely investigate the potential and include them in our distribution, because that is very consistent with our approach."

In the latest version of its Hortonworks Data Platform, or HDP, Hortonworks supports version 12 of Hive. Hortonworks has been working on vastly speeding up Hive from where it was in version 10, as part of the company's Stinger project.

In the future, Hortonworks could integrate Presto with other pieces of the Hadoop ecosystem and offer support in the next HDP release. A similar evolution happened with Storm for stream processing, and with Hive for SQL querying, and Pig for scripting, he said.

Then again, Presto could turn out to be something only webscale companies will want to use, in which case Hortonworks could leave it alone.

As for Cloudera, its engineers made a lot of the same decisions as Presto's architects when they were designing the Impala interactive query engine, said Cloudera's vice president of products, Charles Zedlewski. Like Impala, Presto doesn't use MapReduce, supports queries in good-old SQL style, and aims to be flexible in terms of how others store data, Zedlewski said. Plus, both engines enable lots of queries to run at the same time.

Cloudera could bring certain elements of Presto into Impala as it continues to get new features all the time, Zedlewski said.

One aspect of Impala that's important to enterprise customers is compatibility with most of the business-intelligence tools on the market today, Zedlewski said, and that could be one strength Impala has over Presto in its current form. It's unclear how well Presto can tie in with such software at this point.

And with more users querying data in Hadoop, Cloudera found questions about security -- such as who gets to access what data -- cropping up more often, prompting the company to release features that grant administrators fine-grained control. "The Presto team is going to run into the same issue," he said.

But even with those potential shortcomings, Presto could have an advantage over existing SQL-on-Hadoop tools out there, and it's something only a company with as much analytical data as Facebook could have managed: the ability to run simultaneous queries on many, many petabytes of data. The Presto site states that "Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse." Without a doubt, that's big data territory.

"Presto is, I think, different in some ways, because right from the outset it's claiming that it's (capable of querying) petabytes," said Ben Lorica, chief data scientist at O'Reilly Media. Hadoop distribution vendors might say their products can handle petabytes, but really the optimal use case might be querying hundreds of terabytes.

Perhaps that's what Cloudera, Hortonworks, and other vendors will want to add to their offerings. "Maybe if that proves to be Presto's winning feature, I think they'll try to figure it out," he said.

The vendors could go still further if they want to. In a Sunday blog post, Lorica wrote that Facebook will incorporate into Presto a query engine under development called BlinkDB. That query engine "will tell you, 'How fast do you want this back? If you want it fast, I'll give you an approximate answer,'" Lorica said.

Such functionality could make some Presto users even more productive, assuming they're OK with rough answers. If the idea appeals to lots of enterprises, Hadoop vendors might end up supporting Presto after all.

More