How are things? Is the business going right? Nothing unexpected going on RIGHT NOW? Or let me ask in a little bit different way: if there was anything wrong going on with your platform, how would you know? How quickly would you be able to spot the issue? Who'd be the one to notice that first?
You probably have an answer ready:
- You have monitoring: for both infrastructure and applications.
- You have a shiny dashboard with gauges and meters that tell the processors' utilization and how much disk space is left.
- You have some alerting set as well, so you'll get an automated notification once your HTTP API response error rate jumps above the 5% threshold.
- You have operational procedures in place, e.g., to periodically check the TOP5 longest queries (and their trends).
So you feel safe and quite certain that things are OK.
The malignant nature of inanimate objects
Maybe you're right - because TECHNICALLY everything is indeed fine. The cogs and wheels under the hood are moving in unison. But imagine what would happen if ...
- ... today's release had a slight change in the visual styling library that unfortunately hides your 'submit' button in the order form (your most important transaction view), only in one particular browser (15% of the market share)
- ... 3rd party API for payments has unexpectedly changed its authentication certificate, so, unfortunately, parties can't handshake properly (it doesn't get escalated (via alert) because there was just one error message during the service's init procedure)
- ... mysteriously, the config flag responsible for asset compression and minification has got turned off (on production) so loading your web application suddenly became an annoying and lengthy process, discouraging a lot of new users ...
- ... a typo in business configuration has pretty much removed the most prominent merchant (your most important B2B customer) from your platform's marketplace - but only them; everything else works as intended, every other merchant is visible
- ... out of all your translated resource packages, only one of them (e.g., Polish) has been nullified - the application works properly as long as the end-user has got any locale different than pl/PL; however, in that specific case, all the labels, buttons, descriptions - everything is literally empty (and hence - unusable)
All these sample scenarios have few things in common:
- they happened due to bug/error/mistake, either within the team or out of the team's reach (integration problem)
- those defects weren't trivial to detect via automated testing or other gating conditions (in fact, code wasn't even modified ...)
- additionally, they probably would be detected by neither infrastructure nor application monitoring, so from a technical standpoint, everything is fine ;D ...
... but I'm afraid that end-users (at least some of them) would not agree.
Monitoring, but different
To know your platform's factual health, you need to dig deeper. To be precise, you need to set up proper business (activity) monitoring.
By 'business monitoring', I don't mean end-of-month or even end-of-day analysis (in DWH or any other reporting database). That could have been fine in the late 90s, but the world has leaped forward since then - if things go awry, you need to know (and react!) NOW, not by the end of next week.
The 'business monitoring' I have on my mind is a technique of near-real-time analysis of business-meaningful events' stream, looking for deviations from the norm, irregularities, odd patterns, and suspicious behaviors. Let's follow with some examples to make sure we're on the same page here:
- How many users are logged in (right now)? Hasn't this number dropped in the last hour? Does the change (drop) follow the typical daily pattern? Weekly pattern?
- How many successful payments happened in the last 5 minutes? Why did this number suddenly drop to zero?
- What's the percentage of users who have successfully completed our transactional wizard (all four steps) within the last quarter of an hour? Previous quarter? And the one before? Did those numbers significantly decrease when compared to the ones observed before the morning's deployment?
- What is a mean time of a search API call through our marketplace? Is it within acceptable UX threshold? May that value be somehow correlated to how many offers users have browsed through (on average) within session?
- Who are the TOP10 most popular merchants in the marketplace? How does that list change since morning? When compared to yesterday's equivalent? Last week's? Can the difference be justified with a long-term trend, market situation or ...?
In Support We Trust (?)
Hold on. All of those cases could indeed have a strong negative impact on business, so detecting them is definitely a high priority. But there will be escalations, right? Our end-users will raise the issue once they spot it: contact the support, send a ticket, call the call center, etc.
Well, it depends.
If your platform is a B2B business and it's critical for your customers' everyday operations, you'll definitely get notified this way or another. But if your offering is B2C and/or there's fierce competition present (with minimal switch-over inertia), you may not know until shit has already hit the fan. Furious people could simply uninstall your app, leave a negative rating in an app store and put some hateful, sarcastic comments online, not mentioning the dramatic NPS ...
Would you dare to risk it?
Hmm, how to start then?
OK, so let's assume you're convinced already. But the general idea of near-real-time business monitoring is new to you, and you have tons of questions:
- For whom (is that monitoring)? Who is its recipient/requestor?
- Who should define what has to be monitored? (it's not possible to identify all potential crises apriori, right?)
- Where to get the data from? Is it even present?
- Who should be responsible for setting up and maintaining the monitoring mechanisms? (incl. keeping the up-to-date with all the changes)
- Who should interpret the data? (as clearly not all the deviations are unwelcome)
- Are there any tools that can be used for that?
These all are valid questions. I'll do my best to address them in the next post in the series. Stay tuned!