The silent assassin that knows no mercy - how *coupling* chokes software (part III)

This is part III of coupling consideration topic. If you want to check my past coupling-related posts, you can find them here:

Part I - the theory of coupling
Part II - practical examples of coupling and how does it really affect your apps (& services!)

Now it's time for Part III - let's bring up some ideas about how to fight the coupling off. To make things more spicey, let's start with the most tricky one - the run-time coupling.

What do want to achieve?

We want to make sure that:

Disruption of one service has a minimal impact on other services.
Upgrade or any other non-failure-related maintenance action on one services can be performed without disrupting the actual service.
If one of the services fails somehow and it's just one of the service called indirenctly, it's still possible to quickly find the actual point of failure

The obvious obviousness

The points above can be easily addressed using the following ideas:

Design for failure - never assume that anything you depend on works properly: services may not be achievable, database may be off, files may be absent or disk may be full - surely it will bloat your code-base, but if you do it in a smart way (AOP, handlers, interceptors, filters, etc.), the burden will be barely noticeable
Disconnect your components - use brokered MOMs (message-oriented middleware) or Service Bus to minimize the reliance of components:
make once component not event knowing who deals with a message it sends
abstract out the messages as a depend for particular business service, not a call to pre-defined component
Fail fast - avoid long timeouts: in vast majority of cases timeout longer than 5 seconds means you've designed something badly; it's far better to fail fast and deal with error in a clear and serviceable way then to have false hopes for long timeout solving it for you
Work out the re-call strategy - things will fail, especially in distributed scenario - they will fail constantly and continuously, that's why you should:
make your operations repeatable
make your operations idempotent or ...
... make your operations identifiable, so you can avoid double execution

Awesome, but it won't work here ...

Unfortunately, in many cases brokered MOMs are out-of-the-question. UI-centric solutions usually require immediate responses, systems are not designed for asynchronuous operations and the calls have to be request-response. What then?

Routing (aka dynamically changing your connection endpoint details in real-time) is your word.

If you have to maintain point-to-point connections, you need to find an agile way to re-route the communication (in run-time, in a matter of seconds after making the decision about change) once:

you've detected that service provider is off-line / doesn't work properly
the service provider itself asks to be switched off (whatever the reason is)
you (as an admin) want to upgrade the service in a fluent manner (by powering up a new, upgraded instance)

So, in short words - what you need is a run-time configuration server that:

serves the addresses of particular services identified by a common-known truth (like contract name)
is durable and failure-resilient (well, we're talking about a potential Single Point of Failure, aren't we?)
serves data in a generic way, to satisfy all the endpoints regardless of tech they use

Don't worry, we've got something for you

Fortunately, you don't have to create such a configuration server on your own - there are solutions perfectly (or almost) fit for that purpose:

Apache ZooKeeper - so-called distributed synchronization service, the most mature of the solutions mentioned here. At first glance, it may look a bit complicated, but the idea is quite simple:
you can publish data into hierarchical dictionaries of technology-agnostic information (with basic common type list)
the access is not limited to one, particular platform / programming language: there are various clients for different platforms
service distribution is not based on simple gossip protocol - it's more complicated (under the hood, because deployment-wise it's just one type of service on all nodes), but it's also far more resistant to issues like network partition
the key point is, that ZooKeeper isn't much more than a distributed storage with a reasonable architecture and nice set of client libraries - if you want to use it for particular purpose (like service discovery) you have to wrap it properly on your own

If you want to check out ZooKeeper's features, start here or get this book - a recommended read!

Serf - very simple (but still powerful) approach, clearly aimed for DevOps as it has no programming API, just executables meant to be called in scripts. What's so special about Serf?
it's rather something like a discovery service than a configuration service: it makes you able to act (associate a handler with an event) after membership changes in distributed system: for instance, if you running an instance of another service is associated with Serf agent joining Serf cluster, this action could trigger appending address of this particular new instance to configuration files on instances linked with the other members of the cluster: perfect tool for distributed system admins, isn't it? :)
due to its simplicity, it's error-proness is quite limited, but OTOH it's fully based on gossip without a 'leader' node, so it may behave weirdly in network partition scenarios
Serf itself doesn't store any data, it's power is based on membership, not distributed storage - more details to be found here.
Consul - the new sheriff in town, massive contestant brought up by creator of Vagrant, Packer, Serf - HashiCorp. How does it differ from ZooKeeper and Serf?
in short words: it's like a ZooKeeper dedicated clearly for configuration management - no wrapping is needed
it has a very nice set of HTTP & DNS (!) APIs for both: publishing and reading configuration information
it uses Serf as a low-level foundation, but it's actual architecture is more similar to ZooKeeper's (there's actual leader election, based on Raft), but it's supposed to deal with multi-datacenter scenarios as well.
there are some health check features as well, but to be honest I didn't have a chance to give them a go yet - there are more details to be found at Consul's webpage.

Consul is still very fresh (it's first release has been published just few days ago), but even before it was officially released, it has been tested & proven in production scenarios.

In next episode, my own hands-on-experience while using all the three tools mentioned above. Stay in touch.

The silent assassin that knows no mercy - how coupling chokes software (part III)

What do want to achieve?

The obvious obviousness

Awesome, but it won't work here ...

Don't worry, we've got something for you

Platform Keepers, Container Herders - how we've started doing SRE

White-box monitoring setup as a 1st class citizen in your code

Will Docker change anything for Devs?