It just became easier to diagnose runtime performance issues at scale, thanks to Twitter. The tech giant today open-sourced Rezolus, a “high-resolution” telemetry agent designed to uncover anomalies and utilization spikes too brief to be captured through normal observability and metrics systems. Twitter says it’s been running Rezolus in production for over a year, and it says it’ll continue development on the public GitHub repository.
“Rezolus provides a collection of signals to help us make sense of fine-grained runtime behavior. We’ve found it particularly helpful in understanding and optimizing performance,” wrote Twitter staff site reliability engineer Brian Martin in a blog post. “With a single agent, we’re able to get telemetry from a wide range of sources. To our knowledge, no other open source project offers such comprehensive insight in a single package.”
According to Martin, Rezolus arose from an internal need to observe systems performance on a “fine-grained” timescale. Twitter engineers running high-throughput synthetic benchmarks frequently ran into seconds-long performance anomalies, which the company’s existing telemetry solutions failed to reflect because of their low sample rate relative to the length of said anomalies. The laws of digital signal processing dictate that sampling rates must be at least twice the duration of the shortest burst in order to accurately reflect the intensity of a burst.
By contrast, Rezolus can precisely measure performance degradation on a fine timescale.
Rezolus allows configurable sampling rate or aggregation on a minutely basis, letting developers match the resolution to spike length. Toggleable plug-in samplers enable it to collect telemetry from a variety of sources, including counters and gauges from Linux kernel sources to get telemetry on CPU usage, network utilization, and disk utilization. Additionally, Rezolus can tap hardware and software performance counters to measure things like the number of cycles per instruction, cache hit-rates, and branch predictor performance. And the tool supports eBPF (Extended Berkeley Packet Filter) for kernel instrumentation using kprobes and tracepoints, allowing it to capture metrics like scheduler latency, block IO size distribution, file system latency, and more.
At 10Hz sampling, Rezolus can reflect consecutive bursts running 200 milliseconds or more without requiring more than 15% processor utilization and 60MB memory. In one recent incident in which several Twitter products were throttled by a backend service, it revealed bursts of over five times the baseline traffic during which processor utilization hit 100%.
“Open-sourcing Rezolus marks an important milestone for the project,” wrote Martin. “We hope that Rezolus will be useful to others outside of Twitter, and look forward to building a community around it.”