We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!


This article was contributed by Nikolay Ganyushkin, CEO and founder of Monq Lab

When tech companies scale up, it’s vital for them to use metrics to monitor growth. But as the volume of data and complexity of operations increase, the metrics themselves may become a source of confusion. You get dozens of indicators that you start prioritizing and dividing into metrics for business and tech units, and may lose track of the customer needs. In this article, I’ve proposed a way to calculate an ideal metric that can reliably reflect the availability of an IT service or object from a business standpoint.

Why do we need yet another metric?

The idea of an “ideal metric” came to me at the early stages of my career, when I noticed that the metrics for the IT unit of a company where I worked didn’t accurately correlate with the situation on our business side. Later I saw more examples of this problem among the IT units of our corporate customers. It made me realize that many KPI and SLA calculations were obscure for IT units, and that they often didn’t let business and tech teams find a common language. I decided to create a single synthetic metric that would be clear and understandable to all parties. In short, this metric should be:

  • Business-oriented. It should show how our IT environment functions, but not from the standpoint of server performance, but from the standpoint of how important it is to our customers.
  • Comprehensible. It should be easy to interpret unambiguously for both IT crowd and managers.
  • Decomposable. The metric structure should let us decompose it into components, or factors, and ideally enable us to do root cause analysis as the output.

There are two ways to derive this metric:

1) To make a separate metric for the availability of each service or object (Service Availability)

2) To build a general health map for the system as a whole, which is more complicated but can be our ultimate goal.

What is the Service Availability metric?

I defined Service Availability as a state of the IT environment where customers of a business want to and are able to use a service, and are satisfied with its quality. This should be a Yes/No or 1/0 metric, since anything in between blurs the picture.

An example: imagine that a customer wants to apply online for insurance, but the system is malfunctioning. The server takes a long time to respond, so the application form is constantly being reset or shows errors, and an application can be submitted in 30 minutes instead of 10. As a result, the customer goes elsewhere, and the conversion rate in the sales funnel drops. From the engineering point of view, the service is degraded but still available, so things are fine. From the business point of view, the state of the IT environment where you’re losing hot leads is unacceptable. In that case, Service Availability is equal to 0.

An opposite example: imagine that due to unforeseen circumstances, a data center is completely shut down, but the company’s IT systems switch to backup servers. To customers, everything looks fine, even if a bit slower than usual, so they can apply for insurance relatively quickly, and the conversion rate in the sales funnel does not decline. From the business point of view, things are fine, and Service Availability is equal to 1. From the engineering point of view, though, half of the company’s servers and communication channels are unavailable, so things are not fine.

As you see, the service is available if it’s accessible to customers, so Service Availability is defined exclusively by the business side of the company. Here, however, it’s important to not extend business metrics to IT and to not set conversion as KPI for the IT unit. The IT unit’s job is to ensure the functioning of the IT infrastructure but not attract or retain customers.

Specifics of implementation for the Service Availability metric

So, how do we calculate Service Availability? A service often consists of a complex of information systems and can represent a chain or smaller services. (For example, a data center is responsible for providing virtual machines, a cloud — for the service of system components, information systems — for application services, and so on.)

Here I should clarify that, in umbrella monitoring systems, we frequently deal not with ready metrics, but with specific events or alerts that have emerged in various monitoring systems. In the Service Topology, they are considered to be coming from different configuration items (services, servers, virtual machines, clusters, etc.) or CIs. Taking into account the paragraph above, the overall Service Availability is essentially the cumulative availability of a CI group. The availability of a select CI is an assessment of its performance from the standpoint of the availability of the ultimate service to which this CI contributes. This interconnection allows to carry out factor analysis and determine the contribution of each CI to the overall result, thus defining a bottleneck.

When building a report on Service Availability, first of all, we need to define a list of emergency situations or their aggregate that indicate the dysfunction of the service. We should also think of additional parameters, such as:

  • Service working hours. For example, is it important that the service is available only during daytime hours or only on holidays?
  • Do we have an RTO (recovery time objective) — the maximum allowable time during which our object can be in an emergency state?
  • Whether or not we take into account the agreed service windows.

Besides, monitoring systems, too, make mistakes sometimes, so we should consider whether emergencies should be verified by engineers (if we have such a mechanism).

The method itself

Firstly, let’s calculate Service Availability for a single CI. By this stage, we have already configured all the problem filters and decided on the parameters of our calculations.

To calculate the service availability (SA) for a particular period, it is necessary to construct a function of the CI problem status versus time, Problem(t), which can take one of the four values at any moment of time:

  1. The value (0) means that presently the CI has no problems that correspond to the filter;
  2. The value (1) means that the CI has a problem which passes the filter conditions;
  3. The value (N) says that the CI is in an unattended state;
  4. The value (S) says that the CI is in an agreed maintenance window.

As a result, we get the following indicators:

  • timeNonWorking – the aggregate CI non-working time span in the considered period. The function value was “N”.
  • timeWorkingProblem – the time spent by the CI in a state that does not meet the SLA requirements in the investigated period of time. The function value was “1”.
  • timeWorkingService – agreed idle time when the CI was in a service mode during working hours. The function value was “S”.
  • timeWorkingOK – the time span during which the CI satisfied the SLA requirements. The fProblem(t) function had state “0”.

The calculation of Service Availability (SA) for a single CI for a given period is carried out according to the formula:

SA = timeWorkingOK / (timeWorkingOK+timeWorkingProblem) * 100%
Service Availability metric - An example of possible distributions of time intervals when calculating SA (Service Availability) for a single CI

Above: Service Availability metricFig. 1 An example of possible distributions of time intervals when calculating SA (Service Availability) for a single CI

Image Credit: Nikolay Ganyushkin
Service Availability metric - An example of the influence of RTO on the calculation of the function fProblem(t)

Above: Fig. 2 An example of the influence of RTO on the calculation of the function fProblem(t)

Image Credit: Nikolay Ganyushkin

For calculations of a CI group availability, which is Service Availability Group (SAG), it is necessary to build the function fProblem(t) for each CI included in the group. Next, we should superimpose the resulting functions fProblem(t) for each CI on top of each other, using certain rules (see Table 1).

Table 1

Above: Table 1

Image Credit: Nikolay Ganyushkin

In the end, we get the function fGroupProblem(t). We sum up the duration of the segments of this function as follows:

  • timeGroupService – time when fGroupProblem(t) = S,
  • timeGroupOK – time when fGroupProblem(t) = 0,
  • timeGroupProblem – time when fGroupProblem(t) = 1.

Thus, the metric we’ve been discussing is defined as:

SAG = timeGroupOK / (timeGroupOK+timeGroupProblem) * 100%
Service Availability metric - An example of possible distributions of time intervals for calculating availability of a CI group

Above: Fig. 3 An example of possible distributions of time intervals for calculating the availability of a CI group

Image Credit: Nikolay Ganyushkin

Business impact analysis

It is important not only to get the Service Availability metric, but also to be able to decompose it into components. This will enable us to understand which problems became critical, and which made the smallest contribution to the current situation. This set of activities is called Business Impact Analysis (BIA), and it lets us identify how each particular IT component supports each particular business service of our company. Knowing these dependencies will make our business steadier and more resilient, and help us understand which areas of the IT environment need more attention or investments.

This approach, however, has some limitations:

  1. In the method of determining Service Availability, it is impossible to define the weight of a select problem if several problems occurred simultaneously. In this case, the only parameter will be the duration of the problem.
  2. If two or more problems occur simultaneously, then for such a period we will consider the duration of each with the weight of 1/N, where N is the number of problems that occurred simultaneously.

Calculation method:

  1. We should take the function fProblem(t) that was built when calculating SA.
  2. For each segment where the final function fProblem(t) = 1, we make a list of the problems of this CI, depending on which this segment was assigned the value of 1. When compiling the list, it is necessary to take into account the problems that emerged or ended outside the time span of the function.
  3. Assign to each problem a metric of influence. It is equal to the duration of the problem in the segment multiplied by the corresponding weight. If there was only one problem in the segment, the problem is assigned a weight of 1. In the case of multiple problems, the weight is equal to 1 / N, where N is the number of simultaneously occurring problems.
  4. When calculating, the following points should be taken into account: In the general case, on the same segment at different intervals, the weight of the problem could change due to the appearance of new problems. The same problem can be present at different segments of fProblem(t) = 1. For example, a problem emerged on Friday, ended on Tuesday, and on weekends the CI is not serviced according to the SLA.
  5. Eventually, you should form a list of problems that were taken into account in the calculation of the function fProblem(t). At the same time, a metric of influence on Service Availability should be calculated for each problem.
  6. It is imperative to verify the calculation. The sum of the impact metrics for all problems must be equal to timeWorkingProblem.
  7. The user usually needs to display the relative value of the influence in percentages. To do this, the impact metric should be divided by timeWorkingProblem and multiplied by 100%.
  8. If we need to group problems and show the influence of the group, it is enough to sum up the metrics of all the problems included in the group. This statement is true only if the following condition is met: each problem is included in only one group at a time.

Conclusion

We have derived and calculated a Service Availability metric that is simple, business-oriented, and decomposable. It enables us to assess the state of the IT environment of a company not from the purely technical standpoint, but from the standpoint of what service said environment actually provides to the company’s customers. However, we should keep in mind that this metric is purely retrospective and cannot be used for predictions in isolation from component health metrics and plans for infrastructural changes.

Nikolay Ganyushkin is the CEO and founder of Monq Lab

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!

Read More From DataDecisionMakers

Author
Topics