The prose of life

Test automation ain't trivial & there are several reasons for that, including:

  • need for scalable structure (hierarchy, naming, etc.)
  • dealing with dependencies (coupling)
  • insufficient speed of execution affecting dev agility
  • shared resources impairing isolation
  • inertia of data (incl. configuration, parameterization, etc.)

But there are two even bigger issues I'd like to mention (& elaborate more about one of them):

  • it's tricky to say whether you've automated much enough of what actually should be automated
  • it's even harder to determined that it has been automated in a proper way (prevents bugs from happening, instead of just covering the code)

Skimming the cream

But let's assume you've made quite an effort to automate the regression testing of your solution (yes, usually it's a continuous effort that spans for quite a long time, but let's simplify that for the sake of brevity). Now you should check whether you've done a good job or not really - in other words: measure the outcome of automating the tests.

Obviously metrics like code coverage with tests are completely useless, but you could monitor the actual number of errors (raised by users; on official test environments and / or production environment):

  • NO (OR JUST SIGNIFICANTLY LESS) ERRORS = "Well Done Son" (/ Daughter)
  • ERRORS = Bah, something went wrong

And now I'm getting to the actual point (& reason of this blog post) - THERE'S QUITE A BIG CHANCE YOU WON'T NOTICE ANY DIFFERENCE - after making a huge effort to properly (tech-wise) auto-test your codebase, you'll still have the same level of incident numbers.

What the f...

People tend to forget that errors may be caused by several different reasons than the code itself:

  • incorrect configuration / parameterization (especially when you've got some redundancy & manual modifications are allowed)
  • immature environment deployment processes and / or tools - when deployment methods are not repeatable & allow some degree of "manual adjustment", you're living in the world of snowflake environments (just like in "snowflake servers")
  • insufficient control over the access to environments (people modifying shared test data / resources, uploading their own artifacts, etc.)
  • flawed coop between business & IT people who have a different understanding of what's to be done (for whatever reason)

Seriously, before you start addressing the alleged problem (low code quality), prove that this is the sweet spot that is worth addressing.

How? Measure & analyze:

  1. Get the SIR database for the reasonable timeperiod (& the environment with adequate code maturity)
  2. Filter out the stuff that is not related to your interest.
  3. Qualify the incidents based on their resolution - by splitting between several "root causes":
  • code issue
  • test data issue (incl. parameterization)
  • human issue (miscomm, human error during speccing)
  • configuration issue
  • infrastructure / hardware issue (non-artifact, out of teams' control)
  • deployment issue (difference between setup of dev & test environment)
  • etc.

It's the simplest & most effective way to find what you're really struggling with. And believe me, the results may surprise you a lot and in the end you may find out that it's possible to reduce the number of SIRs significantly within weeks, if not days (I've got real-life cases when it was literally hours) - by doing something completely different than writing automated tests.

Pic: "The Matrix" (1999), Warner Bros.

Share this post