##MonitoringSucks Terminology (first stab)
Posted by mattray on July 12, 2011
Inspired by the recent ##monitoringsucks discussions, I thought I’d add my thoughts on creating a common set of terminology so we can start making progress.
There are a multitude of monitoring solutions out there, but most can be categorized and described with the following basic terminology and components:
Each of the major components could be a separate, single-purpose application. With consistent APIs and interchangeable implementations, best-of-breed solutions could arise. A catalog of monitoring tools could be cultivated and maybe monitoring wouldn’t suck as much.
This is the gathering of raw data that we care about for monitoring. There are 3 components to Collection:
The data points that you want monitored. These can be OIDs, metrics, REST calls or whatever. They may be performance and/or availability, active and/or passive. This is the raw data.
Metrics have a range of legitimate values, thresholds are the limits on the legitimate values. These may be on individual or combinations of metrics.
The actual process of gathering data varies depending on the metrics. There are a wide variety of monitoring protocols (SNMP, WMI, Syslog, JMX, etc.), we need to document how we collect the metrics.
This is the representation of what you are collecting, a collection of metrics and thresholds. The Model is a collection of Nodes. A Node is typically a single machine, but may cover multiple of metrics from separate machines or services (think services and clusters) depending on the implementation. There may be no Model whatsoever (lists of metrics checks).
Events are what happens when a threshold is violated. They may be suppressed, de-duplicated and possibly correlated with other events. There may be dependencies between Nodes or correlations with other Events, implementations may vary.
Separate from Events, alerting is the means to notify people and systems that an Event requires attention. There are numerous mechanisms for alerting (email, paging, asterisk, log, etc.) and ideally the Alerting component has the concept of users, schedules and escalation rules.
There are 2 pieces to the Presentation component:
The Monitoring solution may or may not have/need a UI. This is visual representation of the Model, Events and possibly Alerts. There may be a Dashboard rolling up different views into the information captured by the monitoring solution.
Ideally the data captured by the monitoring solution is available for whatever reporting you want to do. It may be in SQL databases, RRD or some other format but the ability to access the data and create new reports is essential.
Ideally every component should have published APIs for interacting with programatically and/or remotely. Without an API, monitoring tools become less and less relevant in the face of increasing automation.
As with APIs, all monitoring framework components need to be easily automated by configuration tools.
Where metrics are stored. There are lots of choices, they should be accessible for reports and via an API.