Adopting Hybrid Cloud? Adapt Your Incident Monitoring

Image Attribution: Flickr

Public, private, and hybrid cloud implementations each have distinct characteristics and advantages, but now more than 70 percent of enterprises have a hybrid cloud strategy in place.. Industry analyst firm Gartner predicts that the hybrid cloud market will reach $80 billion in total by 2018. For companies using some custom mix of public and private clouds, there’s a lot to love about the ability to choose the right approach for each particular application workload. But it can come with a big drawback: visibility is slow and oblique when something goes wrong in a hybrid cloud environment, as you might have heard from a recent Microsoft Azure outage incident.
You can’t manage what you can’t see. So if your enterprise is depending on a hybrid cloud to support internal-facing and customer-facing production systems, you must ensure that you can track real-time event streams across every software and infrastructure component across the entire “cloud stack,” intelligently sift the incident “signals” from all the noise, and then separate the causal indicators from all the collateral indicators.  This is the basis to automate incident detection and remediation in cloud environments. Some IT folks are starting to call this “Cloud Situational Awareness.”
Situational awareness, however, is a unique challenge in hybrid cloud environments (and increasingly in wider enterprise IT environments). This is because hybrid clouds are heterogeneous, highly abstracted, and dynamic. This means that adding workload mobility is operationally complicated. With a hybrid cloud, you just can’t monitor everything the way you used to. The lack of visibility “wide and deep” when something goes wrong is extending mean-time-to-remediate, affecting everything from development, functional testing, and ultimately production. To try to support hybrid cloud in production, IT operations teams often assemble multiple monitoring tools, sometimes using home-grown tools like dashboards. But this often results in simply moving the monitoring data around a proliferation of screens that are still needed to seek situation awareness when something goes wrong, eating precious time before incident remediation can actually start. As organizations make their hybrid clouds plans a reality, here are three recommendations that will help IT teams detect and respond to service-affecting incidents much earlier, before customers complain:
Connect views for public and private to see deep across. Enterprises are weaving together multiple cloud services (public and private) to optimize performance, flexibility, and costs. Public cloud services offer monitoring portals and APIs so you can get some insight into what’s going on, but that’s only part of your total hybrid cloud architecture. To gain full awareness of incidents when they occur, you’ll need to connect the monitoring data from public cloud domains with the data streams from your private cloud domains. If you have to flip between numerous dashboards and tool screens ( just think about workloads on VMware hypervisor vs. KVM),  it’s going to take you a long time to triangulate an incident’s cause.  Ideally, you should aggregate it all – events and alerts from each cloud domain in your environment—with automated incident detection across the whole, enhancing situational awareness to your DevOps and production support teams.
Connect views for apps through infrastructure to see deep down.  Hybrid clouds are designed in layers, and when something goes wrong, you need to connect down through them. But new technologies like OpenStack for open clouds and Docker for containerization aren’t making this easier. Like any cloud technology, they create additional layers of abstraction—layers that “cloud” the ability to see the source of a problem. To find out where in the entire “cloud stack” an anomaly originated is step one of your incident resolution process. But it’s not an easy step unless you have some way of analyzing event streams in real-time across all the layers to pinpoint incident cause. You need to create intra-cloud visibility, so you can aggregate visibility across all layers of your cloud implementation, starting with the application layers, then through the cloud layers, and finally through to the infrastructure layers – connecting it all will accelerate incident response.
Automate correlation to see root cause faster. Cloud environments are inherently dynamic, scaling up and down capacity and adapting configurations dynamically to meet your business applications needs on the fly. That’s great for the business, but this is an incredibly complex and interrelated process—meaning that a lot can go wrong. It’s often difficult to see an incident unfolding and find its root cause quickly. Are these alerts causal or collateral? Wading through all the event and alerts streams when something goes wrong is finding a needle in a haystack.  Adding correlation analysis across all event streams can result in big savings in time and resources.  In particular, real-time, automated correlation based on big data analytics is now the preferred approach, given the dynamic nature of cloud environments.  Old methods based on static rules and models don’t work for the hybrid cloud. Automation that cleans and clusters the event streams as they occur, grouping them into incident “situations,”  reduces the actionable workload of operational alerts that have to be handled, and accelerates root cause detection, so that incidents can get resolved faster and outages restored sooner.
No question: hybrid cloud environments are here to stay and are increasingly vital for the competitive success of enterprises going forward. But at the moment, they are evolving more rapidly than the tools to manage and monitor them. Don’t think that your legacy way of incident detection and remediation can adapt to your new IT cloud environment. Invest now in building the operational tool architecture needed to handle the support challenges of hybrid cloud, and you’ll avoid getting mired in cloud incident fog.
Disclaimer: This article was written by a guest contributor in his/her personal capacity. The opinions expressed in this article are the author’s own and do not necessarily reflect those of the editorial team at