OpsRamp’s Ciaran Byrne on managing multicloud and hybrid environments

While cloud was supposed to simplify IT management, it also complicated other parts, such as monitoring what was going on within a sprawling environment. It was hard enough keeping track of all the different systems, applications, and tools when everything was dispersed through the corporate network and data centers. With the cloud making it easier to spin up server instances, it was easy to overlook what was running. The challenges multiplied with the shift to hybrid -- the mix of cloud and on-premises -- and multicloud -- with applications living in different clouds.

Hybrid and multicloud environments make sense for redundancy or resiliency. If something goes wrong in one environment, the other one is still available. In some cases, it may be as simple as picking the infrastructure best suited for a specific application. Whatever the reason, the idiosyncrasies of each cloud infrastructure means IT departments can't just take a one-size-fits-all approach to IT management and monitoring.

Ciaran Byrne, vice president of product strategy at OpsRamp, spoke with VentureBeat about the complexities of managing cloud/multicloud hybrid environments.

This interview has been edited for clarity.

VentureBeat: What are the top challenges associated with managing IT infrastructure in mixed (cloud, multicloud, and hybrid) environments?

Ciaran Byrne: The biggest challenge is dealing with the complexity. It’s not just a matter of cloud and on-premises; you have networks, servers, storage, virtual environments, containers, and applications that you have to discover and collect metrics on, and those are running in both cloud and on-premises environments.

In most cases, you’ll be managing these mixed environments with multiple monitoring tools, leading to tool sprawl. You’ll have to make sense of large volumes of data coming from these mixed environments managed by a diverse toolset. The environments that are mixed will likely have inter-dependencies which may make it difficult to be aware of and troubleshoot issues. Troubleshooting may also be more complicated, as each of the environments will have their own nuances for investigating and resolving issues that require operators and admins to have a broad range of skills.

Once you’ve “solved” the problem of monitoring these hybrid environments, you have to understand which parts of this hybrid infrastructure are supporting which application services. Then you have to respond effectively when a problem is detected, to triage and manage the incident, consolidate alerts to the same event, then route the incident for remediation. If you’re doing this all manually, it’s a long and cumbersome process that will take too much of your IT Ops team’s time, so you need to be able to automate it as much as possible.

VentureBeat: What are some idiosyncrasies of each platform that makes them challenging?

Byrne: If you’re talking about the three major public cloud platforms -- AWS, Azure, and Google Cloud Platform -- they all have their own sets of services that have to be monitored. Each one has more than 50 different services across compute, storage, network, database, containers, security, IoT, etc. So there’s no one-size-fits-all approach that you can take.

That said, these services are fairly similar so a user familiar with one should be able to quickly learn another. They all have their own monitoring tools like Amazon CloudWatch, Azure Monitor, and Google Cloud Operations Suite, that provide data that you need to aggregate and integrate. You’ll want to use an agent-based approach for some services, an API-based approach for others.

You’ll want to use a query tool that’s designed for these environments like PromQL rather than using the same query tool you’d use for your internal environment, like SQL.

VentureBeat: What are some best practices associated with managing mixed environments?

Byrne: For starters, eliminate the “swivel chair” effect. Even if you take a best-of-breed approach to monitoring, you’ll want to bring all of those metrics -- server, storage, virtualized environments, network, cloud, containers, applications -- into one place for aggregation and integration.

These are dynamic and changing environments. You'll want to use ML/AI techniques to manage them, so that your management system is continually learning and updating. This makes incident management, event correlation, and alert consolidation so much easier.

Also, be careful to avoid duplication of effort across your monitoring stack. Your APM [application performance management] tools are great at application metrics, but probably aren’t telling you anything about your infrastructure that you don't already get from your ITIM [information technology investment management] stack.

Another tip: Deploy service-aware topology maps to understand the dependencies between your business services and IT services. Once you understand the business service dependencies, you can get a better handle on how much your cloud services and other IT services are costing and which services can be shut off or retired without impacting your business.

VentureBeat: You mentioned the “swivel chair” effect, but you also noted that there are tools specific for each cloud (and on-prem) environment. How do I manage to get metrics in one place if all the tools are separate? And can we get an example of what automation looks like once the problem of monitoring multiclouds have been solved?

Byrne: Tool sprawl is a fact of life in IT Operations today. Half of all survey respondents in our 2021 State of Digital Operations Management report using between six and 10 IT Ops tools. Another third use between 11 and 20. Regardless of how many tools you’re using, if you take a platform approach, you can aggregate your monitoring data from different tools into events, then apply machine learning to gain insights into those events, whether that’s anomaly detection, predictive alerting, alert rationalization, or probable root cause.

If I have a problem with my application, I can look at data from my APM tool in context with data from the ITIM and NPM [network performance management] tools that are monitoring the server and network that are supporting that application. My Digital Operations Management platform aggregates, contextualizes, and analyzes all this data, so I can see everything in one place, rather than toggling back and forth between these different applications. The metrics can be in one place because Digital Ops Management systems have agents, gateways, and other mechanisms to collect this data in a central location. This provides unified visibility, monitoring and automation based on that data. You can then pinpoint where the problem is occurring, and you’ve eliminated the swivel chair.

As for the automation question, oftentimes it’s as simple as understanding which cloud resources support which applications, then which actions should be taken if an event is detected and which on-call teams need to be notified. Maybe there’s a patch vulnerability on the compute instance that needs to be fixed or an incident remediation policy that needs to be invoked. Maybe secure remote access needs to be granted so support teams can see what’s going on in the cloud instance for themselves. Those are all processes that can be automated in multicloud environments.

VentureBeat: Can you elaborate a bit more about duplication of effort? Are you suggesting reducing the number of tools being used?

Byrne: Tool sprawl is a fact of life in IT Ops as I mentioned above, but it’s also true that IT Ops teams are looking to reduce the number of tools they use. In our survey this year, we didn't just ask people how many IT Ops tools they used but how much they were looking to reduce that tool sprawl: 55% of respondents said they were looking to reduce the number of IT Ops tools they use by 50% or more. That’s not a universal sentiment; another 28% said they were looking to maintain the same number of tools or even increase the number of tools they use. But IT Ops teams clearly have a lot of interest in reducing the number of tools they use. Business and technological needs are always evolving, as is the IT Ops tools market. IT Ops teams should evaluate their tool usage and eliminate tools that offer duplicate metrics. You’ll reduce your monitoring data overhead as well as your tool spend in most cases.

VentureBeat: What kind of insights do we get if we can map IT services against business service dependencies? You talk about shutting down services; can you tell us a story of that?

Byrne: Let’s take an easy-to-understand example. My business service is order management. E-commerce can’t operate without it. That business service is dependent on an order management application, which in turn runs on a server, makes calls to a database on another server or in the cloud, interacts with an inventory management application to make sure the product your ordering is in stock, and also invokes a payment service to process your credit card payment. There are any number of network switches used to make all of these connections. If any of these IT services fail, your order management business service will slow to a crawl or fail altogether.

In addition, business service monitoring may be in the form of a synthetic test to measure the e-commerce site performance, while underlying applications and infrastructure can map to the associated IT services. That gives you a holistic view of the business/user perspective and how it is impacted by the underlying IT services, making it easy to determine the relationship and perform troubleshooting and remediation. A service-aware topology map shows you all the IT services that support your business service and how critical they are to the business service functioning.

When I spoke of shutting down services, this is what I was getting at. Seeing which IT services support which business services helps you to triage and remediate problems faster before users, and your business, are impacted. But a secondary benefit is that you can see where you may have over-provisioned IT services for a particular business service. Maybe you have more compute resources, more network switches, more database servers than you really need. Those IT services can either be retired or reprovisioned elsewhere where there’s a greater need for them. The end result is that you have a more efficient system that’s costing you less to maintain and is easier to manage.

More