TL;DR Concepts of fragility & antifragility apply to software engineering as well - but contrary to the common belief, fragility of the system is not related to the number of bugs or resilience (e.g. MTTR). It's far more about dichotomies between high inertia & high operational agility, openness for change & design brittleness, synchronous sequentiality & asynchronous monitoring - fragile systems (/ methods / tools) in building software tend to increasingly restrict, constrain & lengthen the development life-cycle, while ones that are not fragile preserve theirs "lightness", "leanness" & ease of use.
This is the 2nd post in "(Anti-)fragility in Software Development" series. Previous post can be found here.
After getting through the basic intro to theory of (anti-)fragility let's get back to software engineering - how do its concepts apply to this particular industry? What are the (practical) examples of fragile, not fragile & antifragile systems?
To make things easier, we'll start with the fragile systems first, but before we get into actual cases, please remember that fragility is a systemic term, so it can apply not only to the end-products (software deployed on production & its qualities), but also to delivery processes, all kinds of tooling used in whole delivery life-cycle, maintenance & operations and at last but not least - software-based services' role in value stream (in more simple words: whether it provides actual value).
Examples of fragile systems
To make it more interesting, lets skip all the trivial examples of systems that are obviously fragile, e.g.:
- detailed long-term (1+ year ahead) planning ("we have Hadoop deployment in our IT strategy for 2019 ...")
- BDUFs for large, waterfall projects (/programs) - or in fact, anything with long feedback cycle
- manual execution of repeatable, multi-step, boring activities in highly dynamic environments - e.g. full manual regression, manual deployment / operations
- shared database (across several services / applications / products)
Instead, we can take a look at some less apparent, but equally important cases:
Example #1: "Great unification"
AKA Too-Common Denominator.
This practice is very popular these days - companies of all sizes look for cost savings, unified governance / reporting models, ways to limit required skill-sets to run the business. Typical way to achieve this result is by triggering a "great unification" project: either external consultants (who always know "best") or content-free analysts filter, slice, classify, group & look for patterns to project a snapshot of as-is state to ... well, usually something much worse:
- imposed, "optimized" processes (usually just stripped of iteratively introduced local optimisations)
- narrowed down toolset (e.g. all kind of data in one type of persistent data store, one tool / language / library within category allowed, etc.)
- artificial restrictions (fine-grained roles, strict right-sets, formalized & blocking approvals) that are supposed to tighten / streamline the workflow, but introduce delays & bottlenecks instead
Such unified models are doomed since the very beginning - their life-cycle always consists of two following phases:
- artificially preserved "Frankenstein" that introduces more limitations & restrictions than provides value ...
- ... that eventually (sooner or later) break apart like a house of cards under accumulated mass of diverging needs / expectations
Frankly, the faster it gets to the ultimate stage, the better - as enduring in the latter one makes people ...:
- ... follow stupid procedures instead of think on their own (& adapt to evolving conditions) - "we just follow what JIRA task flow setup enforces upon us"
- ... treat all the problems as nails, because all they have is a hammer
- ... waste effort on fighting the over-zealous system ("hacking the system") instead of actual problems
- ... don't look for improvements, because improvements are deviations from the approved standard
Of course pointing this one as an example of fragility doesn't mean that gullible, carefree fragmentation (or simply "reinventing the wheel" in all organization units) is good - quite the contrary - but there are far better ways to deal with that than "great unification" project.
Example #2: Mixed Domain + Infrastructure
Common practice, because it feels so natural & its consequences are not easily perceivable in work environments of low technical excellence standards (no test automation, irrelevant scalability, unclear availability expectations, etc.). Domain logic is "tainted" since the first lines of the code written, as it's either:
- split between several conceptual layers (potentially including ideas like: stored procedures, DB triggers, manual corrections / interventions)
- dependent on implementation details (e.g. database schema details or SMTP as a protocol for notifications)
- or simply mixed with infrastructure service direct calls (e.g. 3rd party SDK calls due to business rules directly referring to data in DynamoDB)
This works quite fine when your product is small & the goal is to prove something in a quick (even if dirty) way. But shit hits the "fragility fan" once:
- change requests break the redundant dependencies on infrastructure (e.g. there's a need for notification that has nothing in common with e-mail, but your notification data format is SMTP-specific)
- the number / size / complexity of business changes grows - it gets harder & harder to abstract out (from the tangled mass of multiple-responsibility code) the minimal subset of functionality that should be affected by the change
- you'd like to adjust the format of the data (e.g. x-entity relationship), move from one 3rd party service vendor to another or even adapt to the external service non-trivial upgrade
Simple speaking - lack of domain|infra separation (both behind the proper abstractions, e.g. interfaces) causes overall solution brittleness (the lexical association with fragility is absolutely in place).
Example #3: Indefinitely growing (single) context
The degree of complexity for business domains tends to increase exponentially with time - I've seen (several times) this complexity quickly exceeding the intellectual capacities of individuals & even potentially - whole teams accountable for development (business & technical) of particular system. This is natural & consistent with the framework of Cynefin - of whom I am a zealous follower.
But think about that for a while: systems are built by people for other people - users who are supposed to utilize them in order to gain / generate some value. Good system has to be comprehensible & intelligible (for the end-user) - so does an easy-to-use, fit-for-purpose system have to be super-complex in development? Of course not, I dare to say that taming the complexity "on both ends" is just a matter of (among others, but most of all) proper composition.
Wrong composition approach may cause fragility to skyrocket - single, ever-growing context (a typical symptom of composition negligence) is a perfect example of that:
- even the simplest changes to central (key) entities require a gargantuan effort (to both develop & test) as these tend to become clusters of coupling, especially for systems with highly normalized data models (no redundancy)
- different contexts can have different ubiquitous languages (because of different actors & scenarios) - imagine the increasing complexity of matching these differences within 1 "super-context" -> this will get harder & harder with each subsequent change
- in "super-context" there's no need for boundaries & programming against abstractions - which is a natural way to start ignoring both Open-Close principle & Dependency Inversion principle; in more simple words: as anything can freely access anything, its the fastest way to achieve Big Ball of Mud status
... to be continued (very soon) ...
Pic: © Séa - Fotolia.com