The Shining Path of Least Resistance

LeastResistance.Net

Chef for OpenStack Status 11/2

Posted by mattray on November 2, 2012

Getting back into the swing of semi-regular updates. Last week was the Chef Developer Summit, lots of great conversations and quite a few people interested in OpenStack. This week was catching up mostly, trying to clean up a few Essex leftovers before moving to Folsom.

  • I bumped to Essex versions to 2012.1.0 to sync to the OpenStack versioning per feedback at the OpenStack Summit. I tagged everything for Essex and merged to master.
  • Added an ‘lxc’ role to enable using LXC. Just an attribute and it just worked, so awesome.
  • Added placeholder cookbooks for quantum, cinder and ceilometer. These have the suffix “-cookbook” in my GitHub, there was some discussion about wanting to rename the cookbook repos for the other 5 projects, anyone feel strongly?
  • Updated all the Community cookbook dependencies and retested (apt, erlang, database, ntp, apache2, database, mysql, rabbitmq, openssh)
  • Released a new version of pxe_dust which enforces assigning the PXE-booted NIC as eth0.
  • Trying to coordinate Chef support for the bare-metal provisioning tool Razor, ping me if you’re interested.
  • Canceled the NYC Chef for OpenStack Hack Day and NYC Chef Meetup.
  • Preparing for the Opscode/DreamHost webinar “Automating OpenStack and Ceph at DreamHost with Private Chef“.

Next week I’ll be in Chicago presenting at the CME Group Technology Conference, ping me if you’re in Chicago and want to catch up. My OpenStack goals are to merge in the outstanding pull requests and resync with the latest Folsom work from rcbops, hopefully merging in some more branches.

Advertisements

Posted in chef, community, openstack, opschef | Tagged: , , | Leave a Comment »

Gettings Started with the Chef for OpenStack docs

Posted by mattray on October 23, 2012

I have primarily been focused on documentation lately and the http://github.com/mattray/openstack-chef-docs is the repository. Since there is so much interaction between the various components, prerequisites and cookbooks, I felt a unified document format would best serve our needs. The various markdown readmes and documentation is slowly migrating to this single repository so it can be kept updated in a single location and link to the various components.

The docs are in Restructured Text and use Sphinx, which is compatible with the http://docs.openstack.org source docs. The license matches the OpenStack documentation’s Apache V2 and Creative Commons Attribution ShareAlike 3.0 License. Opscode has standardized on this format for our own documentation and in the near future it will be merged upstream with official Opscode documentation.

The evolving document is currently broken into these 6 components:
* Architecture – overview of the architecture for Chef for OpenStack.
* Prerequisites – the hardware, network and operating system requirements.
* Installation – how to install Chef for Openstack.
* Example Deployment – example configuration of a small test lab.
* Knife-OpenStack – using the OpenStack plugin for Knife for provisioning and managing instances.
* Additional Resources – additional useful information and links related to Chef for OpenStack.

The docs are just getting started, lots of placeholders but I’m active writing. Please feel free to send corrections and additional details to help fill things out. There will be a permanent URL for the docs online soon, here is a temporary link:
http://15.185.230.54/

Posted in chef, community, openstack, opschef | Tagged: , , , | Leave a Comment »

Chef for OpenStack Status 10/22

Posted by mattray on October 23, 2012

I’ve decided to start cross-posting my status emails for the Chef for OpenStack project to help spread the word. The Chef for OpenStack mailing list is here, please join: http://groups.google.com/group/opscode-chef-openstack

I apologize for the lack of updates, but I come bearing lots of news. For a quick summary of the state of Chef for OpenStack, check out this deck from my presentation at the OpenStack Summit:

http://www.slideshare.net/mattray/chef-for-openstack-openstack-fall-2012-summit

Speaking of the OpenStack Summit, it was quite productive despite not getting to attend enough sessions due to meetings and booth duty. Monday there was a session on “Upstreaming Chef Cookbooks”, which was essentially a meetup of folks working on Chef for OpenStack. We compared notes and there is quite a lot of work being done in the various branches maintained outside the Opscode one, I’m looking forward to merging as much of the work as possible. Tuesday I gave my general Chef for OpenStack presentation linked above and we had a “DevOps Panel” later that day where there was an engaging discussion on the various issues facing deployers of OpenStack. I’ll link up videos as they become available.

Some short-term takeaways from the Summit where that there is a tremendous amount of development effort I was unaware of and the pace is about to pick up substantially. DreamHost and AT&T have a number of patches to be merged and work has already started on Folsom by several folks. The general consensus was to move the focus to Folsom now that it’s out, the cookbooks have been tagged and the repos have all been merged back to master. The ‘essex’ branches are working and have been pushed to the Community site for direct download and are still available of course if you want to continue development.

There were so many great discussions and ideas shared, I’m really looking forward to the work ahead. I’ll try to post more frequently, so the level of engagement will continue to get better.

Posted in chef, community, openstack, opschef, opscode | Tagged: , , | Leave a Comment »

New Chef BitTorrent Cookbook

Posted by mattray on January 9, 2012

Bittorrent is a well-established protocol and tool for peer-to-peer distribution of files. It is frequently used in large scale infrastructures for distributing content in a highly-efficient and exceptionally fast manner. I decided to write a general-purpose Chef bittorrent cookbook for providing BitTorrent Resources via a Lightweight Resource Provider (LWRP). While there was already a very useful transmission cookbook for downloading torrent files, I wanted to create LWRPs and a set of recipes that made it simple to seed and peer a file with minimal interaction.

Even though there are a tremendous number of BitTorrent applications available, I had 2 requirements: trackerless seeding and the ability to be easily daemonized. After researching quite a few clients, aria2 was found to have the required features and turned out to work quite well. Trackerless torrents proved to be a poorly supported and/or documented feature for most tools, the key to using this with aria2 was to understand the need for seeding node to expose the distributed hash table (DHT) on a single port for ease of use and to include this in the creation of the torrent itself.

TESTING

In testing and benchmarking with a 4.2 gigabyte file (CentOS 6.2 DVD 1) on EC2 with 11 m1.smalls (1 seed and 10 peers), there were a number of interesting results. The chef-client run averaged about 8 and a half minutes (with download speeds around 11.4MiB/s) for the 10 nodes, this stayed fairly constant when moving to 20 nodes.

A separate test was done distributing the file with Apache as well. Not surprisingly the more nodes that were added, the slower the downloads became to the point where some failed because of timeouts. Apache could probably be configured to handle the scenario better, but this is why we use a peer-to-peer solution to avoid the single source. The average chef-client run was about 20 minutes for 10 nodes, which was twice as slow as the same test with 3 nodes.

There are definite bottlenecks on EC2, either on the filesystem or at the network level which is to be expected in a virtualized environment. File allocations on some machines take an order of magnitude longer than others, and some nodes are extremely slow in networking. With the larger test case of 20 nodes, some were even faster than with 10 nodes, a few outliers were exceptionally slow (as seen with any large sample of EC2 nodes). On my gigabit network with physical nodes, depending on the downloading drive (SSD or RAID-0 drives), I doubled these speeds with just 5 nodes. This would indicate that the drives or filesystems are the bottleneck.

TRACKERLESS TORRENTS WITH DHT

The first use case I wanted to get working was trackerless-torrents. To create the torrent we use the mktorrent package. For trackerless seeding from 10.0.0.10, we used the following command:

mktorrent -d -c \"Generated with Chef\" -a node://10.0.0.10:6881 -o test0.torrent mybigfile.tgz

To run aria2 as a trackerless seeder in the foreground on 10.0.0.10, it is important to identify the DHT and listening ports (UDP and TCP respectively).

aria2c -V --summary-interval=0 --seed-ratio=0.0 --dht-file-path=/tmp/dht.dat --dht-listen-port 6881 --listen-port 6881 --d/tmp/ test0.torrent

To run aria2 as a peer of a trackerless torrent on 10.0.0.10, you have to specify the “–dht-entry-point”.

aria2c -V --summary-interval=0 --seed-ratio=0.0 --dht-file-path=/tmp/dht.dat --dht-listen-port 6881 --listen-port 6881 --dht-entry-point=10.0.0.10:6881 --d/tmp/ test0.torrent

This technique works, but has the limiting factor of needing to transfer the torrent file between machines. This is solved in the bittorrent cookbook by storing the torrent file in a data bag (future versions of the cookbook may support magnet URIs to remove the need for a file completely).

LIGHTWEIGHT RESOURCE PROVIDERS

Once I had identified the commands that worked for these operations, they needed to be encapsulated in a bittorrent cookbook with LWRPs for creating torrents, seeding and peering the files.

bittorrent_torrent: Given a file it creates a .torrent file for sharing a local file or directory via the [BitTorrent protocol](http://en.wikipedia.org/wiki/BitTorrent).

bittorrent_seed: Share a local file via the [BitTorrent protocol](http://en.wikipedia.org/wiki/BitTorrent).

bittorrent_peer: Downloads the file or files specified by a torrent via the [BitTorrent protocol](http://en.wikipedia.org/wiki/BitTorrent). Update notifications are triggered when a blocking download completes and on the initiation of seeding. There are also options on whether to block on the download and whether to continue seeding after download.

RECIPES

The recipes were provided as an easy way to use bittorrent to share and download files simply by passing the path and filename via['bittorrent']['file'] and['bittorrent']['path'] attributes. There are recipes to seed, peer and to stop the seeding and peering.

bittorrent::seed: given the['bittorrent']['file'],['bittorrent']['path'] attributes it will create a .torrent file for the file(s) to be distributed, store it in the `bittorrent` data bag and start seeding the distribution of the file(s).

bittorrent::peer: given the['bittorrent']['file'] and['bittorrent']['path'] attributes it will look for a torrent in the bittorrent data bag that provides that file. If one exists, it will download the file and continue seeding depending on the value of the['bittorrent'']['seed'] attribute (false by default).

bittorrent::stop: stops either the seeding or peering of a file.

FUTURE

I plan on continuing development on this cookbook as it gets used in production environments and would appreciate any feedback or patches. Right now it is Ubuntu-only, but adding support for RHEL/CentOS is on the short-term roadmap and requires finding sources for the `mktorrent` and `aria2` packages. Using magnet URIs instead of a torrent file would probably be more efficient as well, since it would remove the distribution of the torrent file and allow the use of search to specify multiple seeders to prime the DHT.

Posted in chef, opschef, opscode | Tagged: , , | Leave a Comment »

Spiceweasel 1.0

Posted by mattray on January 3, 2012

One of the more useful things I’ve written since I’ve been at Opscode (over a year now) is a tool called Spiceweasel. Spiceweasel processes a simple YAML (or JSON) manifest that describes how to deploy Chef-managed infrastructure with the command-line tool knife. It is fairly simple, but it fills a useful niche by making it easy to document your infrastructure’s dependencies and how to deploy in a file that may be managed with version control. Spiceweasel also attempts to validate that the dependencies you list in your YAML manifest all exist in the repository and that all of their dependencies are included as well.

Examples

There is the https://github.com/mattray/ravel-repo which provides a working example for bootstrapping a Chef repository with Spiceweasel. The examples directory in GitHub is slowly getting more examples based on the Chef Quick Starts.

Given the example YAML file example.yml:

cookbooks:
- apache2:
- apt:
 - 1.2.0
- mysql:

environments:
- development:
- qa:
- production:

roles:
- base:
- iisserver:
- monitoring:
- webserver:

data bags:
- users:
 - alice
 - bob
 - chuck
- data:
 - *
- passwords:
 - secret secret_key
 - mysql
 - rabbitmq

nodes:
- serverA:
 - role[base]
 - -i ~/.ssh/mray.pem -x user --sudo -d ubuntu10.04-gems
- serverB serverC:
 - role[base]
 - -i ~/.ssh/mray.pem -x user --sudo -d ubuntu10.04-gems -E production
- ec2 4:
 - role[webserver] recipe[mysql::client]
 - -S mray -i ~/.ssh/mray.pem -x ubuntu -G default -I ami-7000f019 -f m1.small
- rackspace 3:
 - recipe[mysql],role[monitoring]
 - --image 49 --flavor 2
- windows_winrm winboxA:
 - role[base],role[iisserver]
 - -x Administrator -P 'super_secret_password'
- windows_ssh winboxB winboxC:
 - role[base],role[iisserver]
 - -x Administrator -P 'super_secret_password'

Spiceweasel generates the following knife commands:

knife cookbook upload apache2
knife cookbook upload apt
knife cookbook upload mysql
knife environment from file development.rb
knife environment from file qa.rb
knife environment from file production.rb
knife role from file base.rb
knife role from file iisserver.rb
knife role from file monitoring.rb
knife role from file webserver.rb
knife data bag create users
knife data bag from file users alice.json
knife data bag from file users bob.json
knife data bag from file users chuck.json
knife data bag create data
knife data bag create passwords
knife data bag from file passwords mysql.json --secret-file secret_key
knife data bag from file passwords rabbitmq.json --secret-file secret_key
knife bootstrap serverA -i ~/.ssh/mray.pem -x user --sudo -d ubuntu10.04-gems -r 'role[base]'
knife bootstrap serverB -i ~/.ssh/mray.pem -x user --sudo -d ubuntu10.04-gems -E production -r 'role[base]'
knife bootstrap serverC -i ~/.ssh/mray.pem -x user --sudo -d ubuntu10.04-gems -E production -r 'role[base]'
knife ec2 server create -S mray -i ~/.ssh/mray.pem -x ubuntu -G default -I ami-7000f019 -f m1.small -r 'role[webserver],recipe[mysql::client]'
knife ec2 server create -S mray -i ~/.ssh/mray.pem -x ubuntu -G default -I ami-7000f019 -f m1.small -r 'role[webserver],recipe[mysql::client]'
knife ec2 server create -S mray -i ~/.ssh/mray.pem -x ubuntu -G default -I ami-7000f019 -f m1.small -r 'role[webserver],recipe[mysql::client]'
knife ec2 server create -S mray -i ~/.ssh/mray.pem -x ubuntu -G default -I ami-7000f019 -f m1.small -r 'role[webserver],recipe[mysql::client]'
knife rackspace server create --image 49 --flavor 2 -r 'recipe[mysql],role[monitoring]'
knife rackspace server create --image 49 --flavor 2 -r 'recipe[mysql],role[monitoring]'
knife rackspace server create --image 49 --flavor 2 -r 'recipe[mysql],role[monitoring]'
knife bootstrap windows winrm winboxA -x Administrator -P 'super_secret_password' -r 'role[base],role[iisserver]'
knife bootstrap windows ssh winboxB -x Administrator -P 'super_secret_password' -r 'role[base],role[iisserver]'
knife bootstrap windows ssh winboxC -x Administrator -P 'super_secret_password' -r 'role[base],role[iisserver]'

Cookbooks

The `cookbooks` section of the manifest currently supports `knife cookbook upload FOO` where `FOO` is the name of the cookbook in the `cookbooks` directory. The default behavior is to download the cookbook as a tarball, untar it and remove the tarball. The `–siteinstall` option will allow for use of `knife cookbook site install` with the cookbook and the creation of a vendor branch if git is the underlying version control. If a version is passed, it is validated against the existing cookbook `metadata.rb` and it must match the `metadata.rb` string exactly. Validation is also done to ensure dependencies listed in the metadata for the cookbooks exists.

Environments

The `environments` section of the manifest currently supports `knife environment from file FOO` where `FOO` is the name of the environment file ending in `.rb` or `.json` in the `environments` directory. Validation is done to ensure the filename matches the environment and that any cookbooks referenced are listed in the manifest.

Roles

The `roles` section of the manifest currently supports `knife role from file FOO` where `FOO` is the name of the role file ending in `.rb` or `.json` in the `roles` directory. Validation is done to ensure the filename matches the role name and that any cookbooks or roles referenced are listed in the manifest.

Data Bags

The `data bags` section of the manifest currently creates the data bags listed with `knife data bag create FOO` where `FOO` is the name of the data bag. Individual items may be added to the data bag as part of a JSON or YAML sequence, the assumption is made that they `.json` files and in the proper `data_bags/FOO` directory. You may also pass a wildcard as an entry to load all matching data bags (ie. `*`). Encrypted data bags are supported by listing `secret filename` as the first item (where `filename` is the secret key to be used). Validation is done to ensure the JSON is properly formatted, the id matches and any secret keys are in the correct locations.

Nodes

The `nodes` section of the manifest bootstraps a node for each entry where the entry is a hostname or provider and count. A shortcut syntax for bulk-creating nodes with various providers where the line starts with the provider and ends with the number of nodes to be provisioned. Windows nodes need to specify either `windows_winrm` or `windows_ssh` depending on the protocol used, followed by the name of the node(s). Each node requires 2 items after it in a sequence. You may also use the `–parallel` flag from the command line, allowing provider commands to run simultaneously for faster deployment.

The first item after the node is the run_list and the second are the CLI options used. The run_list may be space or comma-delimited. Validation is performed on the run_list components to ensure that only cookbooks and roles listed in the manifest are used. Validation on the options ensures that any Environments referenced are also listed. You may specify multiple nodes to have the same configuration by listing them separated by a space.

Status and Roadmap

Spiceweasel has now hit version 1.0, it’s fairly complete and there are no open issues. I’ll continue to track changes in Chef and fix any issues that arise. If I start feeling ambitious (or someone sends patches), I may turn it into a knife plugin. I’ve also considered having Spiceweasel “extract” existing infrastructure, by parsing a list of nodes and documenting their cookbooks, environments, roles and run lists. I’ll write another post soon about using it in the “real world”.

Posted in chef, opschef, opscode, spiceweasel | Tagged: , , , | 1 Comment »

Updates to the Opscode Chef drbd cookbook

Posted by mattray on September 30, 2011

I’ve recently been working on some updates to the OpenStack Cookbooks with Dell & Rackspace for use with Crowbar and we wanted to make a few of the services more fault-tolerant. The first step was to add drbd-based drive mirroring, so it was time to update the existing drbd cookbook.

I tested with Ubuntu 10.04 and 10.10 server installations. You must have the ‘linux-server’ and ‘linux-headers-server’ packages installed to properly support the drbd module. The drbd cookbook does not partition the drives, you’ll need to have that configured in advance or with another cookbook. It will format partitions given a filesystem type, set the node['drbd']['fs_type'] to ‘xfs’ or ‘ext4’ or whatever. I used ‘xfs’ in my example roles.

The drbd::pair recipe can be used to configure 2 nodes to mirror a mount point. The master node (identified by node['drbd']['master'] = true) will claim the primary, format the filesystem and mount the partition. The slave will simply mirror without mounting. It currently takes 2 chef-client runs to ensure the pair is synced properly, there is some timing issue with the mount resource that I haven’t identified yet.

Setting up the 2 boxes is fairly simple, just update the 2 included example roles.

drbd-pair-master drbd-pair
name "drbd-pair-master"
description "DRBD pair role."

override_attributes(
  "drbd" => {
    "remote_host" => "ubuntu2-1004.vm",
    "disk" => "/dev/sdb1",
    "fs_type" => "xfs",
    "mount" => "/shared",
    "master" => true
  }
  )

run_list(
  "recipe[xfs]",
  "recipe[drbd::pair]"
  )
name "drbd-pair"
description "DRBD pair role."

override_attributes(
  "drbd" => {
    "remote_host" => "ubuntu1-1004.vm",
    "disk" => "/dev/sdb1",
    "fs_type" => "xfs",
    "mount" => "/shared"
  }
  )

run_list(
  "recipe[xfs]",
  "recipe[drbd::pair]"
  )

Add the roles to your 2 nodes with the proper ['drbd']['remote_host'] values set and the next chef-client run will configure drbd mirroring. Run chef-client again on the master and the share will be mounted. I’ll be hooking this up to pacemaker next.

Posted in chef, opschef, opscode | Tagged: , , , | Leave a Comment »

##MonitoringSucks Terminology: Zenoss Breakdown

Posted by mattray on August 5, 2011

Following up on the 07/21/11 ##monitoringsucks IRC discussion on terminology, I thought I’d break down Zenoss as an example of how I believe the terminology applies.

Primitives

  • metrics: This is the raw monitoring data. Zenoss supports a wide variety of collection techniques, and metrics are stored as “Data Points” in RRD.
  • context: Zenoss has “Thresholds” attached to the “Data Points” which trigger “Events”. Thresholds may be exceeding a value, a specific value, falling within (or outside) of a range or Holt Winters. The Event context contains the originating resource (device and IP), event state (new, acknowledged, suppressed), severity (0-5), event summary, specific details (message) and an event id.
  • resource: As the source of a metric, Zenoss has Devices that are the direct source of the metrics.
  • event: Map directly to Zenoss’ Events, with the context and actions part of the Event subsystem.
  • action: Zenoss has a fairly rich Event system, with a wide variety of possible ‘actions’ when an Event enters the system (whether by a Threshold or some other source). It may be dropped, deduplicated, transformed, sent to history, trigger event commands or generate alerts. Correlation may be done with transforms in Python.

Components

Model

Zenoss tries to create a model of all the monitored infrastructure.
Individual resources are presented as “Devices”, something with an IP address that may or may not be a map to a single node.
Devices are organized in a single “Device Class” which determines how they are modeled and how and what metrics are collected.
“Modeling” in Zenoss is the attempt to discover all the attributes of a device (network interfaces, filesystems, installed hardware and software, etc.).
Modeling is performed by “Modeling Plugins” (attached to Device Classes or individual devices) which may use a variety of protocols to discover what is on a Device (SNMP, SSH, WMI, etc.).
Device Classes have “Monitoring Templates” attached to them that define how and what to monitor.
Modeling Plugins and Monitoring Templates may be reused, overwritten and extended by Device Classes.
Zenoss may be configured to automatically discover the nodes on a network range or subnet and create a network map of all the devices.
Devices may be added to a single “Location”, which may be mapped and presented in the UI with a Google map.
Devices may also belong to multiple Groups and/or Systems (essentially 2 separate tag hierarchies).

Collection

Zenoss supports a wide variety of availability and performance monitoring, from both active and passive sources.
Most protocols map to a specific daemon, responsible for collecting the data and pushing it into the system to be stored in RRD files.
RRD has a variety of ways for storing data, but the metrics are represented numerically with a timestamp.
Out of the box Zenoss monitors

  • ICMP: ping (zenping)
  • JMX: performance monitoring (zenjmx via the zenjmx ZenPack)
  • TCP: port checks (zenstatus)
  • SNMP: performance, process-monitoring and receive traps (zenperfsnmp, zenprocess, zentrap)
  • SSH/Telnet: v1/v2 (zencommand)
  • Syslog: receive syslog messages (zensyslog)
  • WMI: Windows event log (zeneventlog)
  • Zenoss can reuse Nagios and Cacti plugins as well

There are quite a few community extensions (ZenPacks) providing additional collection features

Event Processing

As mentioned in the section on primitives, Zenoss has a Event system that handles context, events and actions.
Events may use their Devices, Device Classes, Locations, Systems and Groups for additional context.
Zenoss Events are stored in a MySQL data base.

Analytics

Correlation of events is done in the Event system, written in Python.
Graphing of metrics are available with RRD graphs and all the variations supported therein (single/multiple values, stacked graphs, multiple devices).
The Event Console makes it easy to quickly search and filter specific event values.
Example reports are included but writing custom reports is difficult because of the disparate storage mechanisms for metrics, events and configuration.

Presentation

Zenoss has a featureful UI with an emphasis on monitoring thousands of nodes at a time and rolling up events in the Event Console.
There is a configurable dashboard that has a number of configurable portlets that may be applied (reports, events, graphs, web sites, etc.) .
It is a webapp mostly using javascript (ExtJS) on top of the Python Zope application server.
Lightweight ACLs are available and multiple users are supported for

Configuration

The user interface for Zenoss is focused on making it easy to manage monitoring thousands of devices by configuring their Device Classes and applying Devices to them (as opposed to individual devices).
While configuration is primarily through the UI, there are tools for bulk-loading devices from files or scripting as well.
There is a command-line interactive interface to the object database (zendmd) that can be used to query and alter the monitored infrastructure.

Storage

Metrics are stored in RRD.
Events are stored in MySQL.
Configuration and relationships between objects are stored in the Zope Object Database (ZODB).

API

Zenoss has published JSON API for interacting remotely with examples in Python and Ruby (most of the UI uses these APIs).
There is also published Developer Documentation for extending and writing plugins.
The zendmd tool may be used to interact with Zenoss programatically as well via scripting.

Conclusions

Zenoss tries to provide a framework for monitoring thousands of machines that is flexible enough to contain network devices, servers and services. The terminology and taxonomy that emerged from IRC discussion fit fairly well, hopefully we can at least attempt to compare apples and apples when it comes to discussing different monitoring implementations.

It would probably be worthwhile to make a future post breaking down the strengths and weaknesses of Zenoss’ approach as well as which components would be easiest to reuse within other systems.

Posted in monitoring, monitoringsucks, zenoss | Tagged: , | Leave a Comment »

##MonitoringSucks Terminology (first stab)

Posted by mattray on July 12, 2011

Inspired by the recent ##monitoringsucks discussions, I thought I’d add my thoughts on creating a common set of terminology so we can start making progress.

There are a multitude of monitoring solutions out there, but most can be categorized and described with the following basic terminology and components:

Each of the major components could be a separate, single-purpose application. With consistent APIs and interchangeable implementations, best-of-breed solutions could arise. A catalog of monitoring tools could be cultivated and maybe monitoring wouldn’t suck as much.

Collection

This is the gathering of raw data that we care about for monitoring. There are 3 components to Collection:

Metrics

The data points that you want monitored. These can be OIDs, metrics, REST calls or whatever. They may be performance and/or availability, active and/or passive. This is the raw data.

Thresholds

Metrics have a range of legitimate values, thresholds are the limits on the legitimate values. These may be on individual or combinations of metrics.

Collecting

The actual process of gathering data varies depending on the metrics. There are a wide variety of monitoring protocols (SNMP, WMI, Syslog, JMX, etc.), we need to document how we collect the metrics.

Model

This is the representation of what you are collecting, a collection of metrics and thresholds. The Model is a collection of Nodes. A Node is typically a single machine, but may cover multiple of metrics from separate machines or services (think services and clusters) depending on the implementation. There may be no Model whatsoever (lists of metrics checks).

Events

Events are what happens when a threshold is violated. They may be suppressed, de-duplicated and possibly correlated with other events. There may be dependencies between Nodes or correlations with other Events, implementations may vary.

Alerting

Separate from Events, alerting is the means to notify people and systems that an Event requires attention. There are numerous mechanisms for alerting (email, paging, asterisk, log, etc.) and ideally the Alerting component has the concept of users, schedules and escalation rules.

Presentation

There are 2 pieces to the Presentation component:

UI

The Monitoring solution may or may not have/need a UI. This is visual representation of the Model, Events and possibly Alerts. There may be a Dashboard rolling up different views into the information captured by the monitoring solution.

Reporting

Ideally the data captured by the monitoring solution is available for whatever reporting you want to do. It may be in SQL databases, RRD or some other format but the ability to access the data and create new reports is essential.

Cross-cutting Concerns

API

Ideally every component should have published APIs for interacting with programatically and/or remotely. Without an API, monitoring tools become less and less relevant in the face of increasing automation.

Configuration

As with APIs, all monitoring framework components need to be easily automated by configuration tools.

Storage

Where metrics are stored. There are lots of choices, they should be accessible for reports and via an API.

Posted in monitoring, monitoringsucks, zenoss | Tagged: | Leave a Comment »

links for 2010-12-16

Posted by mattray on December 16, 2010

Posted in Uncategorized | Leave a Comment »

links for 2010-12-15

Posted by mattray on December 15, 2010

Posted in Uncategorized | Leave a Comment »