It’s unlikely that you, Dear Reader, will ever experience the big-data challenges or infrastructure demands of a Facebook-scale, billion-user software platform. But if you do, you’ll be happy to know that the company is sharing a sip of its secret sauce today.
Called Corona, the aforementioned sip contains a more efficient way to handle scheduling for Apache Hadoop MapReduce.
Facebook’s been Hadoop/MapReduce powered for some time now, “and that served us well for several years,” writes the Corona team today on the company’s dev blog. “But by early 2011, we started reaching the limits of that system. … It was pretty clear that we would ultimately need a better scheduling framework.”
The team briefly flirted with YARN as an alternative, but there were too many incompatibilities with Facebook’s version of HDFS, and the team was fairly sure YARN couldn’t handle Facebook-scale workloads.
So the Facebook infrastructure team started working on a MapReduce scheduler that would scale better, be easier to upgrade, have lower latency for small jobs, make better use of clusters, and schedule “based on actual task resource requirements rather than a count of map and reduce tasks.”
The result is Corona, a push-based scheduling framework that removes cluster resource management from job coordination, tracking the nodes and free resources continuously for minimal latency.
And if that sentence totally made sense to you, you can check out Corona on GitHub now — that’s the version Facebook is currently running in production.
“Corona has allowed us to achieve our initial goals of greater scalability, lower latency, no-downtime upgrades, and better resource management,” the team concluded.
“It has also helped us achieve better scheduling fairness, faster job restartability, a cleaner code base, and the ability to integrate with other systems for scheduling.”
Facebook will be making upgrades to Corona as time goes by, since it’s now a core part of the company’s infrastructure.