Today, Facebook is showing off an internal tool it built to help it quickly find and fix caching issues.
Engineers can use Claspin*, built by Facebooker Sean Lynch, to scan and suss out performance problems from TCP retransmits to timeouts using a heatmap that makes troubleshooting much simpler.
Lynch noted in a blog post today that when he first came to the company, Facebook was working with two caching systems, Memcache and TAO, with thousands of charts and an array of dashboards within the company’s operations data store.
“This worked well at first, but as Facebook grew both in size and complexity, it became more and more difficult to figure out which piece was broken when something went wrong,” said Lynch.
So the engineer thought of heatmaps as a better way to visualize the data, with each square in the heatmap representing a host and with racks grouped together. The heatmaps shows green squares for fully functioning hosts and red squares for hosts with problems. Yellow squares show a metric heading into a less-than-optimal range.
Also, said Lynch, the visual display relates to the racks’ physical layout. “The rack names naturally sort by datacenter, then cluster, then row, so problems common at any of these levels are readily apparent.”
With Claspin up and running, Lynch continued, “On a 30-inch screen, we could easily fit 10,000 hosts at the same time, with 30 or more stats contributing to their color, updated in real time.”
Here’s what Claspin looks like and more about how it works:
*Fascinating side note: Claspin is named after CLSPN, a protein-coding gene in the human genome that acts as a sensor to monitor the integrity of DNA replication forks, checking for damage and replication.