Pinterest has big data, and it knows how to share

Pinterest is no National Security Agency, but the company, which identifies itself as a "visual discovery tool," has grown into a collector of plenty information. Like Twitter, Facebook, Google, and other web giants, Pinterest has developed sophisticated systems for storing the data, but it's also built a tool that lets lots of employees get at it.

In a blog today, Pinterest data engineer Mohammad Shahangian sheds light on the "self-serve platform" he and his colleagues have created for accessing data in Pinterest's Hadoop clusters sitting in the Amazon Web Services public cloud.

That storage system "enables us to put the most relevant and recent content in front of users through features such as Related Pins, Guided Search, and image processing," Shahangian wrote. "It also powers thousands of daily metrics and allows us to put every user-facing change through rigorous experimentation and analysis."

But the team's self-serve tool is much more than just the widely used Hadoop open-source technology for storing and analyzing lots of different kinds of data. It's the sort of thing other companies might want to try out, so that more employees in more departments can use data to improve products and make smarter decisions. That concept has gained credence as startups like Platfora and Trifacta have gotten funding while seeking to simplify various stages of the Hadoop data analysis workflow.

Thanks to the efforts of Shahangian and his team, different people at Pinterest can create Hadoop clusters for different needs. That way precious Pinterest data scientists can focus on things other than just getting data out of Hadoop for their colleagues.

"While it’s possible to scale a single Hadoop cluster horizontally, we’ve found that a) getting perfect isolation/elasticity can be difficult to achieve and b) business requirements such as privacy, security and cost allocation make it more practical to support multiple clusters," Shahangian wrote.

And even though companies can pay for services for setting up Hadoop clusters, they don't always fit the needs of a company that's frequently adding features and moving into more and more countries.

"Trying to use EMR [Amazon's Elastic MapReduce service] out of the box and install all of those things, and making sure all of those things are available, is not easy," Shahangian said in an interview with VentureBeat.

Hadoop jobs run through startup Qubole's Hadoop-as-a-service product on the Amazon cloud, Shahangian wrote in the blog post.

Oh, and if you're wondering how much data Pinterest is dealing with, the company currently throws in 20 TB of new data every day, and about 10 PB of data are in Amazon's S3 service for persistent storage.

And the trend has been that Pinners have been processing more and more data in Hadoop.

"The engineering team is focused on open sourcing many of our technologies this year (like Secor), and we hope to do the same for this, but don't have a specific date," a Pinterest spokeswoman told VentureBeat in an email.

Read the blog post for more detail on Pinterest's self-serve tool for handling big data.

More