3.3 Service Monitor

How do you dynamically monitor an IT service to enable the operational policies in support of the Service Level Objectives (SLOs) and inform what actions should happen to remediate a problem?

3.3.1 Problem

Information Technology (IT) departments are required to deliver increasing levels of service and availability while reducing costs. Service level management is an accepted method to ensure that IT services are meeting the business requirements. The SLA defines (in language that has meaning to the customer) precisely what is to be delivered and when and where it is to be delivered. It also defines the standard of quality to be delivered, usually in terms of performance and availability.

Operational policies are the actionable elements that translate the SLA to IT operation services. Availability management, performance management, capacity management and service continuity management are concerned with ensuring that services continue to deliver the service levels included in the relevant SLAs.

It is imperative that an IT system is monitored and managed with appropriate thresholds and health rules to meet the SLA operational and business requirements.

3.3.2 Solution

The <ServiceMonitor> translates the operational policies to operational health rules, containing the attributes detailing the “what”. The operational health rules become the standardized directives for availability, performance, service continuity management.

Based on the telemetry data persisted in the database, the operational health rules establish the status of a component, establish threshold criteria and automate alerts.

3.3.3 Application

An operational health rule violation represents an event – such as an error or exception generated by the application, the crossing of a performance threshold, or an operational change in the application, such as a JVM restart.

An operational health rule can evaluate metrics associated with an entire application or a limited set of entities. For example, you can create business transaction performance health rules that evaluate certain metrics for all business transactions in the application or health rules that cover all the components in the application or all the components in specified application tier (web tier, application tier, database tier, infrastructure tier).

An event can also be used to trigger a policy, which can initiate automatic actions, such as sending alerting emails or running remedial scripts.

3.3.4 Examples / Use-cases

An operational level agreement (OLA) is an agreement between two teams or functions within an IT service provider. It supports the IT service provider’s delivery of IT services to the customers and the service levels contained in the corresponding SLA. The OLA defines the items or services to be provided and the responsibilities of each party.

The Service Operational Policies are the ones that typically enable most, if not all of the OLAs. An enterprise may have deployed multiple monitoring tools to monitor different components of the technology stack. Service operations policies should be independent of the monitoring tools and inform an autonomic manager what actions should happen for each event.

Policies provide a mechanism for automating monitoring and problem remediation. Instead of continually scanning metrics and events for the many conditions that could suggest problems, you can proactively define the events that are of greatest concern for keeping your applications running smoothly and then create policies that specify actions to start automatically when those events occur.