I was fiddling a lot with a concept of Chaos Engineering (CE) recently. I find it VERY useful when it comes to building the resilience of platforms. But what if we could extend its applicability to other parts of our business/company?
This idea popped up after one of the recent discussions in "CTO Morning Coffee" (periodical LIVE events for technical leaders in Poland). We were debating the differences between the leadership in times of "war" (crisis) and in times of "peace" (stable growth). It wasn't hard to agree on conclusions:
- not everyone is prepared for crisis handling (crisis management is a skill), once it (crisis) happens
- a long time of "peace" can lull our attention
- the skills not used for a long time are prone to atrophy
So maybe we could apply the concept of Chaos Engineering to battle-proof and harden our leaders by inciting controlled crisis experiments?
Or wait - let's widen the scope. What about ... the whole organization? Something like that is well known and successfully conducted in the security space for a long time now (typically known as "red teaming"): penetration tests, simulated social engineering attacks, security audits. Could that be applied even w/o a presence of a traditional malicious actor - to address inconsistencies/gaps/mistakes?
- Are the processes and mechanisms in our org fault-tolerant?
- Is there a clear escalation path for each kind of issue?
- Can we contain (limit the impact zone) a crisis when it occurs?
Hmm, let's try to find out.
Of course, it's far from being as easy as it sounds ("just lift the concept and shift it to the new area"). I have tons of doubts at this point.
- In chaos engineering, the experiments are about REAL failures (or as close to "real" as possible). The idea is to (eventually) run chaos tests continuously on your production environments. Do we want to cause REAL organizational crises ("so right now you're supposed to harass him", "leak out these confidential salary information to the whole team", "fail the project on purpose and don't communicate it until it's over") as well? I don't find it possible.
- So, if the crises are to be simulated - isn't that a distraction from more important (real) work? As long as we are learning, I believe it is not. The same argument was being brought against Chaos Engineering btw.
- But the idea of Chaos Engineering was all about automation - and organizational topics (like human relationships/interactions) can't be automated! Well, that's partially right. Experimenting was indeed supposed to be automated, but observing the outcomes and concluding (e.g., proposing improvements) wasn't.
- If everyone knows the issue is simulated, wouldn't it affect the quality of the experiment? In CE, you cause a technical problem and have quantifiable metrics to tell you the effects (on the rest of the system). For a social/organizational issue, it'd be people who would evaluate (based on their subjective judgment) the outcomes (incl. non-immediate consequences). That's probably the biggest weakness of the whole idea.
- Organizational/people-related issues typically take some time to "develop" and cause harm - could that be somehow simulated? Or would we have to start at a "terminal" stage? (wouldn't it be too much of an oversimplification?)
- What about "realism" and comments like - "this would never happen; not on my watch!"; people don't engage fully if they don't feel it makes sense
- and so on, and so forth - there are many questions and doubts
Besides: who'd be responsible for conducting such "fault-proofing" of the organization? The most obvious answer is Human Resources (or People Ops, or Talent Management, or whatever you want to call them) - but in how many organizations does HR have enough real "power" (influence) and experience to conduct such experiments? Such a role would require a very in-depth understanding of how (nearly) every unit operates (within the whole organization).
Of course, everyone can come up with a bunch of "generic" problems (e.g., a conflict in the team, a major offense/crossing a line, challenging hiring decision, balancing on the edge of the borderlines of autonomy, accountability crisis - lack of ownership). But the idea is to build resilience against "the unknown unknowns", "black swans" - unexpected/rare events we tend to overlook/forget about.
So, what do you think? Is the idea even feasible?
Did any of the organizations you know try something like that in practice?
Maybe there were some dedicated people on the payroll whose sole responsibility was to harden & improve the organization by trying to break it down from the inside?
Please feel free to share your opinions/experiences in the comments below. Thanks in advance.