3.4 Automated RCA

When an IT system or service becomes unavailable, how to automatically identify where the problem is and how can it be fixed?

3.4.1 Problem

In most enterprises, Mean Time To Resolve (MTTR) is one of the key measures used in the service level agreements by the service owners for the services they offer to their clients. Mean time to resolve (MTTR) is the average time that a service will take to recover from any failure.

For faster MTTR, we have to be good at identifying the problems faster. That’s why Mean Time To Identify (MTTI) is an important KPI as well. MTTI is also referred as Mean Time to Detect (MTTD). MTTI is the difference between the onset of a service outage and the actual detection of its root cause. Mean Time To Identify (MTTI) doesn’t necessarily protect you from downtime, but it can focus the recovery efforts on where the problem lies, directly reducing MTTR. This is also helpful for enforcing SLAs.

Therefore, in the digital world of IT operations, the speed to identify the issue, categorize, prioritize and response has never been more important.

3.4.2 Solution

In order for us to see all problems as they occur, we must design and instrument each component of the entire system so that they generate sufficient telemetry, allowing us to understand how our system is behaving as a whole. When all levels of our application stack have instrumentation and telemetry, along with health rules with events, it increases the likelihood of automated RCA and therefore, a reduced MTTI.

3.4.3 Application

It is important to distinguish between machine-learning assisted RCAs (Root Cause Analysis) and machine-learning performed RCAs. By Automated RCA, we mean machine-learning assists in the detection of RCA and let the human-centric process or an <AutonomicManager> address the fixing. The key focus of <AutomatedRCA> is to significantly reduce the MTTI in the current devops world.

3.4.4 Examples / Use-cases

In real-life implementations, the business and operational telemetry may involve one or more databases since many enterprises typically deal with multiple APM (Application Performance Management) and infrastructure monitoring tools. For example, an IT service comprising of an application running on Openstack environment might leverage OpenStack’s telemetry service at the infrastructure layer and an APM to instrument and enable telemetry for application layer and any of open source instrumentation tools such as Zipkin, OraOpenSource or Open Tracing-based instrumentations. Therefore, the health rules and business rules will be attached to the respective telemetry services.

An automated RCA involves correlating events across different components of the stack and their interactions with their dependent services, over a period of time. The effectiveness of an automated RCA comes from well-defined decision trees designed through the past experiences.

Therefore, it serves well to separate the telemetry store and the real time operational intelligence store architecturally.