In an intense big data-themed talk on Facebook’s campus, the company revealed its latest infrastructure project. Codenamed Prism, this project aims to solve one of the biggest problems Facebook has faced operating at its uniquely massive scale: how to create server clusters that can operate as a unit even when they’re geographically distributed.
“With 950 million users, every problem is a big data problem,” said Facebook infrastructure VP Jay Parikh in a press-only meeting. “And one of the big challenges … is with MapReduce.”
MapReduce is a model for processing large data sets using clusters of servers and distributed computing; one of its most common implementations is Apache Hadoop. In the beginning, MapReduce was a great way for Facebook to handle its huge body of user and site performance data.
“But as we got more data and servers over the years, we said, ‘Oh, crap, this isn’t going to fit in our data center. We’re running out of space,’” Parikh said. “We updated the drives to 2TBs, 3TBs,” but that kind of solution wasn’t going to work for long.
“Ultimately, we were physically limited,” Parikh continued.
“One of the big limitations in Hadoop today is, for the whole thing to work, the servers have to be next to each other. They can’t be geographically disperse. … The whole thing comes crashing to a halt.”
And that’s exactly the challenge Facebook is tackling with Prism.
“Project Prism allows us to take this monolithic warehouse … and actually physically separate the warehouse but still maintain the logical single view of the data,” Parikh said.
Prism, he said, “basically institutes namespaces, allowing anyone to access the data regardless of where the data actually resides. … We can move the warehouses around, and we get a lot more flexibility and aren’t bound by the amount of power we can wire up to a single cluster in a data center.”
And with Prism freeing up more room for Facebook’s big data to grow, Parikh said, “Things will automatically replicate; we can restore things; we can have mutliple copies.”
We understand these tidbits of information are still quite vague; the Facebook team is still cooking up documentation for the project and will be making an official engineering blog post soon.
But one question is already answered: Prism will probably be open-sourced sometime soon. That’s just part of Facebook’s version of “the hacker way.”
“Given the other things we’ve done, we want to open source this stuff. These are the next scaling challenges other folks are going to face,” Parikh said.
Stay tuned for more big data news from Facebook.
Image of Facebook’s Prineville data center courtesy of Jolie O’Dell, Flickr