Lambda Architecture - why should we care?

Have you heard about Lambda Architecture (LA)? It's an interesting concept introduced by Nathan Marz ("the father" of Apache Storm) and it's basically about processing massive quantitive of data using two parallel streams:

  • batch stream - this one is able to do pretty much anything with unlimited quantity of data, but it has its latency, stored data has its inertia & crunching (even distributed) can take a lot of time itself
  • real-time stream - this one doesn't really care about all the data you've ever picked, but about the most recent part of it; it's aimed for minimum latency, but the actual range of data being processed is limited up-front. It's supposed to bring the highest benefit for the data which value vanishes over time, in scenarios that require pretty much immediate interaction

You can find more information about Lambda Architecture here.

Why does that deserve a blog post? Because I've been speaking with several people about LA and usually the response is pretty much the same:

Real-time? We don't need real-time. Currently we have all our data analysis in enterprise data warehouse that's able to provide us results that are T-24h at best. And we're good with that. Real-time sounds like an overkill for now - our company's inertia makes us unable to react in a shorter period of time anyway.

Even if that looks reasonable at the first glance, there's something missing. Real-time data processing isn't really about typical reporting: P&L, balances, etc. It's more about:

  • early warnings
  • monitoring
  • security
  • detecting frauds
  • identifying opportunities & threats

Yes, sometimes it doesn't matter whether you detect an issue after 1 second or after 1 hour, but the massive quantity of input data may make you UNABLE to detect it after 1 hour anyway! Big Data isn't just about Volume, it's also about Velocity - the actual speed you absorb information. Sometimes you have to do it ridiculously fast, because there's already shitloads of data waiting 'in the queue' and the faster you start, the more smooth the flow is.

Apart for technical rational (based on specifics of streaming data processing) - is there really value added in having immediate feedback like the one described above? Well, that kinda depends on what you expect of your IT systems:

  • if you prefer pro-active approach instead of handling consequences ...
  • if you don't like leaving any things to chance ...
  • if you want your applications to truly adapt to end-user needs instead of guessing / spamming him ...
  • if you're not satisfied with mediocrity, but aim for high quality service your customers will admire & recommend to others
  • if you believe in continuous improvement (which is dependent on the short feedback loop)

... then I believe you're already convinced. Otherwise, sorry for wasting your time.

Pic: © Les Cunliffe - Fotolia.com