What I'm going to write below is nothing new for anyone who've done fully fledged distributed systems development. BEAM veterans, hAkkers, Rafters, ZooKeeper consensus pursuers or Paxos tribe would probably call me Cpt. Obvious, but surprisingly (or not) majority of .NET people I meet & talk to still live the myth of ...
Developers worship Defensive Programming. Once they realize (some never do) that there's something else but the Happy Path, they do whatever they can go get back to it ASAP when derailed a bit. They are fully convinced that they are not only capable of identifying all potential faults (that can, and will happen), but (what is more) handle them gracefully so processing follows like nothing happened.
Such pride doesn't end well:
- excessively long time-outs
- complex, scattered & unreadable exception handling code
- half of codebase's just there to deal with various compensation scenarios
And all of that for nothing - it's just naivety. It's pretty much impossible to write systems that won't err. And what's more important, being highly available doesn't mean that particular requests / processings don't fail. The key point is that system AS A WHOLE doesn't fail (is available, responsive, resilient, etc.).
The idea is not to prevent crashing at all cost, but to ...
Let it crash
If you haven't heard this statement before, start with reading a primer like this one. Yes, you've read it right - the idea is to crash when you encounter problems, but:
Fail Fast, because low responsiveness (high latency) is far worse than short notice about the error; Long time-outs don't just hide actual problems & make trouble-shooting harder, if they get out-of-sync over the layers (tiers) you may even get time-outs for processing that has actually processed properly.
Fail locally - use bulkheading to isolate failures within the nearest context, so the remaining part of the system (that operates properly) is unaffected; By 'unaffected' I don't just mean proper (& un-interrupted) execution, but also load handling, resource sharing, overall stability, etc.
Avoid the overload caused by errors or any other kind of unusual behavior - if processing an error takes longer than processing the normal request & the number of requests / transactions is pretty much constant (as is request load) repeated errors may effectively (intentionally or not) DoS your apps services. To prevent that, use some kind of Circuit Breaker.
All techniques mentioned above are important & valuable, but the first & the most important step is to accept the fact that errors & failures will keep happening and ...
It's absolutely normal
The best thing you can do is to embrace this fact & ... befriend them :) Instead of denying their existence, invest in:
- common logging format (semantic logging ain't a stupid idea)
- integrated log browsing with a fast access & real-time search
- proper monitoring (both real-time & historical analysis) of crucial metrics
- log archiving (so your infrastructure doesn't get massive inertia due to useless archives)
Having tools like ELK stack deployed and set-up in all your environments is not an extravaganza anymore. It's a crucial element of your development architecture & if you're missing it, you're not doing your stuff properly. Not being able to efficiently trouble-shoot your application(s) is something unacceptable for a professional software developer / team. It's a first step to getting overburden with support / maintenance duties & cripple your causative power that you could utilize to create something that adds value.
Pic: © Maxim_Kazmin - Fotolia.com