Netflix explains outage, wants to hire engineers who can help prevent a repeat

(Updated 7/9/2012 with comment from Netflix engineer Ariel Tseitlin.)

Netflix has published an explanation of the Amazon EC2 outage that took down the video-streaming service for hours on June 29 and 30.

In addition to Netflix, the Amazon outage took down Instagram, Pinterest, and other sites that rely on Amazon's cloud-based computing service. On July 4, Amazon published its own explanation of the snafu, which started with a power failure, led to a failure of the backup generators, and then caused a series of cascading failures with its load-balancing hardware.

Netflix's analysis discovered that it, too, had problems with its own load-balancing service. "This caused unhealthy instances to fail to deregister from the load-balancer which black-holed a large amount of traffic into the unavailable zone," Netflix engineers Greg Orzell and Ariel Tseitlin wrote in the post.

Despite the "storm" of failures, Netflix remains confident about its decision to move to cloud-based services in late 2010. The company's resiliency to cloud failures has actually gotten better, the post argues.

"While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved," Orzell and Tseitlin wrote.

However, the company is looking for help. If you're an experienced cloud operations engineer, Netflix wants to see your résumé, as it is continuing to hire people for its Cloud Operations and Reliability Engineering team.

"We're still actively beefing up our reliability engineering team...always looking for good people," Tseitlin tweeted to VentureBeat.

Story via TNW.

Photo credit: Arthur40A via photo pin

More