3.2 Autonomic Telemetry

How to collect actionable operational and business data across different components of IT systems for business and operational monitoring?

3.2.1 Problem

Telemetry refers to an automated communications process by which measurements are made and other data collected at remote or inaccessible points and transmitted to receiving equipment for monitoring.

3.2.2 Solution

The <AutonomicInstrumenter> connects the IT system instrumentation agents using a Message Channel, where one the agent can send information to the channel and a controller can read that information from the channel and decide to persist in a database.

Create a telemetry layer in your app. This allows you to easily switch instrumentation providers at any time, if the need arises, and encapsulates any computation or bucketing logic behind the layer’s interface so that it doesn’t interfere with the rest of the app. A simple enablement flag that you check within each method of your layer also makes it very easy to turn telemetry on and off (as when the user opts out) without touching any other part of your code.

3.2.3 Application

The telemetry database is where events, health rules for operations, business rules for metering and usage patterns are stored.

The business and operational questions that you want to answer through telemetry helps decide the type of Telemetry events to track.

3.2.4 Examples / Use-cases

Examples of business questions, the telemetry pattern should answer include:

  • How are customers really engaging with the product or service? Do people use the features you thought they wanted?

  • What areas of the product are the users spending (or not spending) their time? What are the usage patterns look like over a period of time?

  • What is the conversion rate of users from trial version to purchased product?

  • What form factors do the users prefer?

  • Are there usage patterns that you can reward or discourage in some way to drive behavior?

Examples of operational questions we can answer during problem resolution include:

  • What evidence do we have from our monitoring that a problem is actually occurring?

  • What are the relevant events and changes in our applications and environments that could have contributed to the problem?

  • What hypotheses can we formulate to confirm the link between the proposed causes and effects?

  • How can we prove which of these hypotheses are correct and successfully effect a fix?