The Shining Path of Least Resistance

LeastResistance.Net

Archive for July, 2011

##MonitoringSucks Terminology (first stab)

Posted by mattray on July 12, 2011

Inspired by the recent ##monitoringsucks discussions, I thought I’d add my thoughts on creating a common set of terminology so we can start making progress.

There are a multitude of monitoring solutions out there, but most can be categorized and described with the following basic terminology and components:

Each of the major components could be a separate, single-purpose application. With consistent APIs and interchangeable implementations, best-of-breed solutions could arise. A catalog of monitoring tools could be cultivated and maybe monitoring wouldn’t suck as much.

Collection

This is the gathering of raw data that we care about for monitoring. There are 3 components to Collection:

Metrics

The data points that you want monitored. These can be OIDs, metrics, REST calls or whatever. They may be performance and/or availability, active and/or passive. This is the raw data.

Thresholds

Metrics have a range of legitimate values, thresholds are the limits on the legitimate values. These may be on individual or combinations of metrics.

Collecting

The actual process of gathering data varies depending on the metrics. There are a wide variety of monitoring protocols (SNMP, WMI, Syslog, JMX, etc.), we need to document how we collect the metrics.

Model

This is the representation of what you are collecting, a collection of metrics and thresholds. The Model is a collection of Nodes. A Node is typically a single machine, but may cover multiple of metrics from separate machines or services (think services and clusters) depending on the implementation. There may be no Model whatsoever (lists of metrics checks).

Events

Events are what happens when a threshold is violated. They may be suppressed, de-duplicated and possibly correlated with other events. There may be dependencies between Nodes or correlations with other Events, implementations may vary.

Alerting

Separate from Events, alerting is the means to notify people and systems that an Event requires attention. There are numerous mechanisms for alerting (email, paging, asterisk, log, etc.) and ideally the Alerting component has the concept of users, schedules and escalation rules.

Presentation

There are 2 pieces to the Presentation component:

UI

The Monitoring solution may or may not have/need a UI. This is visual representation of the Model, Events and possibly Alerts. There may be a Dashboard rolling up different views into the information captured by the monitoring solution.

Reporting

Ideally the data captured by the monitoring solution is available for whatever reporting you want to do. It may be in SQL databases, RRD or some other format but the ability to access the data and create new reports is essential.

Cross-cutting Concerns

API

Ideally every component should have published APIs for interacting with programatically and/or remotely. Without an API, monitoring tools become less and less relevant in the face of increasing automation.

Configuration

As with APIs, all monitoring framework components need to be easily automated by configuration tools.

Storage

Where metrics are stored. There are lots of choices, they should be accessible for reports and via an API.

Posted in monitoring, monitoringsucks, zenoss | Tagged: | Leave a Comment »