Following up on the 07/21/11 ##monitoringsucks IRC discussion on terminology, I thought I’d break down Zenoss as an example of how I believe the terminology applies.
Primitives
- metrics: This is the raw monitoring data. Zenoss supports a wide variety of collection techniques, and metrics are stored as “Data Points” in RRD.
- context: Zenoss has “Thresholds” attached to the “Data Points” which trigger “Events”. Thresholds may be exceeding a value, a specific value, falling within (or outside) of a range or Holt Winters. The Event context contains the originating resource (device and IP), event state (new, acknowledged, suppressed), severity (0-5), event summary, specific details (message) and an event id.
- resource: As the source of a metric, Zenoss has Devices that are the direct source of the metrics.
- event: Map directly to Zenoss’ Events, with the context and actions part of the Event subsystem.
- action: Zenoss has a fairly rich Event system, with a wide variety of possible ‘actions’ when an Event enters the system (whether by a Threshold or some other source). It may be dropped, deduplicated, transformed, sent to history, trigger event commands or generate alerts. Correlation may be done with transforms in Python.
Components
Model
Zenoss tries to create a model of all the monitored infrastructure.
Individual resources are presented as “Devices”, something with an IP address that may or may not be a map to a single node.
Devices are organized in a single “Device Class” which determines how they are modeled and how and what metrics are collected.
“Modeling” in Zenoss is the attempt to discover all the attributes of a device (network interfaces, filesystems, installed hardware and software, etc.).
Modeling is performed by “Modeling Plugins” (attached to Device Classes or individual devices) which may use a variety of protocols to discover what is on a Device (SNMP, SSH, WMI, etc.).
Device Classes have “Monitoring Templates” attached to them that define how and what to monitor.
Modeling Plugins and Monitoring Templates may be reused, overwritten and extended by Device Classes.
Zenoss may be configured to automatically discover the nodes on a network range or subnet and create a network map of all the devices.
Devices may be added to a single “Location”, which may be mapped and presented in the UI with a Google map.
Devices may also belong to multiple Groups and/or Systems (essentially 2 separate tag hierarchies).
Collection
Zenoss supports a wide variety of availability and performance monitoring, from both active and passive sources.
Most protocols map to a specific daemon, responsible for collecting the data and pushing it into the system to be stored in RRD files.
RRD has a variety of ways for storing data, but the metrics are represented numerically with a timestamp.
Out of the box Zenoss monitors
- ICMP: ping (zenping)
- JMX: performance monitoring (zenjmx via the zenjmx ZenPack)
- TCP: port checks (zenstatus)
- SNMP: performance, process-monitoring and receive traps (zenperfsnmp, zenprocess, zentrap)
- SSH/Telnet: v1/v2 (zencommand)
- Syslog: receive syslog messages (zensyslog)
- WMI: Windows event log (zeneventlog)
- Zenoss can reuse Nagios and Cacti plugins as well
There are quite a few community extensions (ZenPacks) providing additional collection features
Event Processing
As mentioned in the section on primitives, Zenoss has a Event system that handles context, events and actions.
Events may use their Devices, Device Classes, Locations, Systems and Groups for additional context.
Zenoss Events are stored in a MySQL data base.
Analytics
Correlation of events is done in the Event system, written in Python.
Graphing of metrics are available with RRD graphs and all the variations supported therein (single/multiple values, stacked graphs, multiple devices).
The Event Console makes it easy to quickly search and filter specific event values.
Example reports are included but writing custom reports is difficult because of the disparate storage mechanisms for metrics, events and configuration.
Presentation
Zenoss has a featureful UI with an emphasis on monitoring thousands of nodes at a time and rolling up events in the Event Console.
There is a configurable dashboard that has a number of configurable portlets that may be applied (reports, events, graphs, web sites, etc.) .
It is a webapp mostly using javascript (ExtJS) on top of the Python Zope application server.
Lightweight ACLs are available and multiple users are supported for
Configuration
The user interface for Zenoss is focused on making it easy to manage monitoring thousands of devices by configuring their Device Classes and applying Devices to them (as opposed to individual devices).
While configuration is primarily through the UI, there are tools for bulk-loading devices from files or scripting as well.
There is a command-line interactive interface to the object database (zendmd) that can be used to query and alter the monitored infrastructure.
Storage
Metrics are stored in RRD.
Events are stored in MySQL.
Configuration and relationships between objects are stored in the Zope Object Database (ZODB).
API
Zenoss has published JSON API for interacting remotely with examples in Python and Ruby (most of the UI uses these APIs).
There is also published Developer Documentation for extending and writing plugins.
The zendmd tool may be used to interact with Zenoss programatically as well via scripting.
Conclusions
Zenoss tries to provide a framework for monitoring thousands of machines that is flexible enough to contain network devices, servers and services. The terminology and taxonomy that emerged from IRC discussion fit fairly well, hopefully we can at least attempt to compare apples and apples when it comes to discussing different monitoring implementations.
It would probably be worthwhile to make a future post breaking down the strengths and weaknesses of Zenoss’ approach as well as which components would be easiest to reuse within other systems.