Bakin', bakin', bakin' - logging for enterprise scale apps (Part 1)

Story: large, on-line app that servers hundreds of thousands users in the real-time (via web and mobile) has some issues with proper managing events and logs. Logging is centralized (in relational DB), has a SPoF (http://en.wikipedia.org/wiki/Single_point_of_failure) and its load has an impact on the overall solution performance. Additionally, the time gap between requesting access to the logs and actually getting into it physically is nothing close to instant.

Task: make it better.

Here we go. Let me present some brainstorming we’ve went through - there were several interesting options considered, so I guess it’s worth reading.

Idea #1 - Root of all evil, yarr

First, let’s take it out of relational DB. There’s really no point in putting it there as:

application runs on several parallel hosts, so storing the logs in the DB kills the out-scalability that was pretty much there
relational DB for querying is neat, but you won’t let users make query on the DB you’re logging into in the same time, righto?
logging is synchronous, so on-line replication can make things even worse

Agreed.

Idea #2 - Lord of the Flies, erm… Files

Why won’t each node log into the files locally (on the same node) instead of DB? Ol’ good log4net file appenders should do well, don’t you think? And once the information is within files, we could setup either LogStash (http://logstash.net/) or even Splunk (http://www.splunk.com) to rip through that.

But flatfiles have some obvious cons - you can get your data as soon as logging process switched to another file (rolling) and it’s very inconvenient for any analytical operations. So, files as a mean of transport = yes. But basic files as a repository of logged information = no.

Nice try though.

Idea #3 - Hadoop (HDFS)

If we have already parallelized application, why won’t we go Hadoop? Each node could write to the separate storage (HDFS) and we could run map-reduce jobs to bucket the events into organized bundles (for instance - sorted by user & date). It looks appealing, but … Hadoop sucks for streamlined data. It’s awesome once your files are closed and complete, but it means that the latency between actual event and its availability to show up in the analysis will remain.

Kinda sucks, let’s leave Hadoop be. For a while.

Idea #4 - Asynchronous logging

So, what about … - instead of logging into DB, put the logged events into some kind of persistent queue (MOM, brokered - http://en.wikipedia.org/wiki/Message-oriented_middleware). Ol’righty, performance-wise it’s a good step, but it’s just a pre-step to the actual solution. And we have to keep in mind, that if we decide to make logging asynchronous and logging is still critical (every operation has to be traceable in logs), we have to address it somehow.

Idea #5 - What to do with the queue?

Well, if something goes in, it ought to go out. But what do we expect of the reader who receives the logging events? First, that it has to be out-scalable, to make sure that we don’t get overflooded with messages. Next, I would expect it to do some preliminary event classification (initial segmentation) to separate the information with different level pf importance (some may be needed for a short period of time, some for longer - that would require different archiving strategy). That sounds like a Hadoop in real-time, but how to do that?

[ To be continued - it will get a bit Stormy and you can expect some hipster databases inbound ]