In 2016 pretty much every commercial software system built with modern tech is a distributed system. Yes, we all build distributed systems these days. Obviously, the term "distributed" is very spacious - I'm not saying we're all creating stuff like eventually consisted, sharded data stores (should be left to specialists who excel in these), but having 2+ layers of independently scaled, communicating processes in production is today's bread'n'butter.
I'll save you all the standard talk about usual implications like: increased complexity, need for being asynchronous & message-driven, etc. - if you follow my blog posts at least once in a while (or just share interest in distributed systems), you most likely know those by heart (& hopefully apply this knowledge in your everyday work).
Let's get into something that I've barely mentioned before ...
Message-driven ain't enough
So, you've ...
- got rid of RPC or secured it with circuit-breakers & short time-outs
- applied bulk-heading to isolate the components
- deployed message queues (either persistent or not) to off-load the spikes
And you think you're good & done? Great, except the fact you're not.
It's not that I don't believe you ...
I assume you're able to cut off failing components, insta-report/log faults, avoid starvation due to locking shared resources, etc. Everything done by the book. It means that your components will more likely survive the crisis when it happens (which is already an achievement, no doubt about that), but ...
- they won't be able to give early warning to consumers (so they can adjust their behaviour in a way it will prevent the problem from happening)
- they won't have any means to give hints (of how to react) to consumers (retry? wait? reduce request numbers?)
- using queues to isolate communicating parties is OK, but if consumers' throughput is consistently lower than producers' one - it will work only for some, limited time - queues can't grow indefinitely ...
- queues don't really solve the balancing problems (except for short-term spikes), they just provide a (time) buffer, so some other "causative agent" deals with them - either by manual or automated re-configuration (adjusting the number of component instances on various layers & traffic between them)
Hence the idea of backpressure.
The shortest possible version - it's a feedback mechanism used by producers to let know consumers how they should adjust their "appetite", e.g.:
- if producers can't satisfy consumers' needs, they should restrain themselves a bit
- if load from different consumers is equal, but processing some requests is "lighter" (some resources are wasted), consumers who make these requests can increase their load / batch size / etc. (while other should stay at current level or even reduce it)
As it should be by now more than clear to see, it is quite a change in terms of communication paradigm -> now it starts to look like a true dialogue, where proper, healthy, balanced system behaviour is being shaped by mutual cooperation of all communicating parties.
I'm not saying that the additional meta-information ("meta", as it's standardised & not related to actual business domain) returned is complex or something, it just should be agreed as a part of some generic protocol, to make backpressure a "feature" of every inter-component interaction, not something that has to be re-invented / manually added for each new chunk of functionality.
Backpressure ain't a new idea, but it has got more attention when it was put in a definition of reactive application (within Reactive Manifesto) & (especially) when Reactive Streams specification was born (yes, I know it has roots in Rx, but I won't be going through the whole story this time ;>).
Actually I find Reactive Streams a brilliant idea & I couldn't help being enthusiastic about this initiative since the very beginning - actually I think it's a great example of what .NET Community (& many other ones) can envy JVM Community. And to prove the it's not just a theory, but there's a list of actual, mature implementations behind it already, you can find the list of them here: https://en.wikipedia.org/wiki/Reactive_Streams.
But what about user-facing applications? Users tend to generate load in quite a consistent way (like a bloody, never-ending avalanche ...) - if latency rises or errors tend to appear, it doesn't reduce the inflow -> actually, it may end up bumping up, because people keep refreshing pages, retrying operations, etc.
In case of "cyber" consumer (process) it's easy - we assume it follows the convention set by the backpressure (e.g. reduces the load, if asked), but what can we do to "force" actual humans to collaboration? My preferred solution is to decide "for them" - adjust client app logic to respond to feedback (e.g. by disabling some functionality, introducing local caching, debouncing or throttling).
GenStage enters the stage
But I have another reason (apart from the pure fact that backpressure is crucial in apps/services resilience & scalability) to explore this topic today: just few days ago Jose Valim has announced the official release of GenStage, which is ... yes, you've guessed it right: Elixir's idea of how to implement inter-process event exchange with backpressure (obviously on OTP).
Jose's introduction (link above) is just perfect to get familiar with the basics (no worries about the steep learning curve, it's actually very approachable), but I'd recommend grokking one of the details in particular - GenStage.Flow. Plainly speaking - Flow is a conceptual extension of backpressure-aided event communication to unbounded event streams!
How cool is that?! Yes, exactly.
For now it's still very experimental (& marked as such), but I've decided to give it a try within next couple of days anyway - you can expect some written impressions soonish. In the meantime, if you want to check up on general GenStage's usage sample, here's a concise & clear sample of code that actually does something: http://www.elixirfbp.org/2016/07/genstage-example.html
Pic: © nikolayn - Fotolia.com