What a n00b!

Documentation and Monitoring

At DevOpsDays Minneapolis a few weeks ago, we were discussing the topic of documentation within the context of Operations/DevOps/IT/whatever. After I talked a bit about what we did at our company, I realized that we were sort of unique, and others found this technique useful. I thought I'd share a bit about what we're doing. Certainly we're not all the way there yet, but striving to improve over time.

Playbooks

For any check that gets added to the monitoring system, what we call a "playbook" must be written before the pull request is approved and merged. A playbook is essentially a document describing the following things:

  1. What this check is checking - this seems obvious, but should include things like data sources; the "what"
  2. What the impact of this alert could be - what services would be affected; the "why"
  3. Where to go digging for more info on what could've caused this state; at a minimum, start with a log file
  4. Bonus: an idea of what "normal" looks like

A playbook doesn't necessarily have to have all that much info, just enough to give the person who's oncall a fighting chance. This also seems to be a fairly nice format to begin documenting something. It gives practical knowledge at the time it's needed.

Notifications

We've recently migrated to Sensu, which helps us out a lot here. Since it acts as a 'monitoring router' and lets us setup everything as we want, we can easily add arbitrary data and display them in alerts however we want. All checks are defined in Puppet and we add a playbook in the custom data on a check like so:

$wiki = 'http://wiki.example.com' check 'elasticsearch' { ... custom => { ... playbook => "${wiki}/Elastic_Search#Dealing_with_Pages", ... } }

Once this playbook is defined as an attribute on the check, you can easily add it to the message goes out. In this case, we're using the standard mailer handler from the community repo with lines added something like this:

playbook = "Playbook: #{@event['check']['playbook']}" if @event['check']['playbook'] ... body = <<-BODY.gsub(/^ {14}/, '') #{@event['check']['output']} Host: #{@event['client']['name']} Timestamp: #{Time.at(@event['check']['issued'])} Address: #{@event['client']['address']} Check Name: #{@event['check']['name']} Command: #{@event['check']['command']} Status: #{@event['check']['status']} Occurrences: #{@event['occurrences']} #{playbook} BODY

This adds a link in the email body to the playbook if it exists (like I said, not perfect yet :)).

Conclusion

When faced with the challenge of building out documentation for an environment, writing down the 'what', 'why', and 'where to start digging' when something pages is an excellent (and seemingly often overlooked) first step. No one has time to read a 10 page manual in this scenario which will force the writing to be concise and as helpful as possible. Obviously, implementation of this concept will vary wildly, depending upon which monitoring solution you might use.

Comments

Comments powered by Disqus