White-box monitoring setup as a 1st class citizen in your code

There's an interesting trend of expanding the range of work products developers produce as the outcomes of their work. In software development prehistory it was just functional code (one that's supposed to provide business value directly). After some time automated tests have joined it - as a parallel code construct, meant to be submitted within the same commit. That was just a beginning, as the same has happened to (among others):

  • data schema modifications (DDLs & equivalents)
  • deployment routines (automated provisioning, configuration mgmt, etc.)
  • infrastructure setup (infrastructure-as-code)

All of these are now key code constructs (as important as functional code itself) - vital elements of solution that should stay synchronized & be released in a coherent state jointly.

I'm extremely happy to notice the new (more & more frequent) addition to this list above: monitoring setup.

Now & then

Monitoring today is significantly different to what was considered an industry standard even few years ago:

  1. Being reactive (monitoring only what has caused trouble before, as a result of corrective actions after the outage) is not good enough anymore - proactive monitoring is a must to achieve high levels of availability
  2. Monitoring was considered to be Ops domain (& their problem ;>), Devs didn't care as it was out of their availability - that had several serious implications, i.e.:
    • favoring so-called black-box monitoring (based not on internal, app-specific metrics, but on what could be read on OS/platform level)
    • monitoring setup was always in kind of pursuit of the most recent version of functional code
  3. Synchronous, pull-based monitoring (JMX, WMI) was considered a standard -> sadly, this option was not only more likely to impact overall performance of monitored resource, but also scaled poorly (more querying parties = more load due to monitoring itself)
  4. Ignoring domain-specific (business-specific) metrics & focusing on low-level tech metrics only (% of CPU utilization, memory used, etc.) - well, I'm not saying these are useless - obviously it's not the case - I mean that they are highly insufficient as they don't really tell you how your system behaves & what are the actual impacts (you usually knew it from different sources, when it was already too late, like your call center ...)
  5. Monitoring has always been a goldmine for companies delivering software to the enterprise sector - they were producing heavy, centralized, closed (not only in terms of source code, but also knowledge, documentation & API) & pricey SPoF behemoths fit not only for monitoring but also permanent vendor-lock :)

Fortunately industry has not only started to fully appreciate monitoring, but many new (& based on more modern paradigm) tools have emerged to fill the gap. Open (& open-sourced), push-based, asynchronous, for white-box-style solutions, lightweight & horizontally scalable. What is more (& maybe even more important) - very DevOps friendly - these days monitoring for a service / component can easily be set / updated directly in code, as an element of automated deployment pipeline:

  • in an isolated, decoupled way (w/o strict dependency on any "central" component)
  • fully testable, in automated way as well
  • w/o a need to bother Ops or anyone with "special privileges"
  • repeatable & idempotent

If you're wondering how such monitoring look alike in practice, you may want to check on Riemann or even read a neat book on monitoring that's built around Riemann as the main tool (my review here).

Frankly, I know some people who are not enthusiasts of white-box monitoring approach. Their aversion is caused by belief that this sort of couples application logic to monitoring (processing logic pushes the business-meaningful event to log/monitoring sink explicitly, so logic is aware of monitoring sink) & to some point I agree they are right, but keeping in mind the benefits (mentioned above) I think that the sacrifice is justified, as long as sink communication is non-blocking, has minimal time-out (maybe also sort of circuit-breaker) & failing to send an event doesn't stop business processing (lack of events in monitoring infrastructure is a clear enough indication that something is wrong here).

Examine your conscience

It seems to be a good opportunity to ask yourself few questions: if you're a developer in 2016 & your monitoring ...

  1. ... is so far none of your concern ...
  2. ... rules are always set up only as a consequence of post-mortem activities ...
  3. ... is just a blinking red-green dashboard of stuff that is always ignored (because there's always something red there anyway) until there's a call from support / angry customer ...
  4. ... is being set up manually for each environment, but only for the few that have Splunk licenses acquired ...

... there's something wrong, very WRONG going on in your workshop.

Proper, aware, proactive monitoring can help prevent outages entirely or at least minimize troubleshooting time & MTTR. Why don't you check the times in recent Stack Exchange outage post mortem:

  • 10 mins to identify issue
  • 14 mins to fix it
  • 10 mins to roll it out & bring services up back again

How close to that you can declare yourself to be?

Pic: © surasakpd - Fotolia.com