Where did we stop? As far as I remember we were having some brainstorming about the shape of enterprise-ready large-scale logging solution for IT landscape running several applications for 1M+ users. And we had some kind of a cliffhanger when dealing with queues of events …
Idea #6 - It’s all about events
What would we like to do with all those nicey, shiny events we’ve received from the source systems? Well, we’d like to have them written in our storage, that’s for sure. But do we need all of them in one, particular storage? What if we want to classify the logged events? Segment them according to their importance or level of abstraction?
That’s not the only idea - what about transforming events into another kind of “aggregated” events? For instance, system could aggregate average response times within 5 minute-long windows (time frames) and generate events with such information, so it can be stored for future analysis. System could analyze another event patterns as well, look for suspicious behavior or just analyze user op patterns to learn something about how end-users do.
It’s named Complex Event Processing (http://en.wikipedia.org/wiki/Complex_event_processing) and Apache Storm (http://storm-project.net/) is the way of doing that we’ve picked.
Idea #7 - Storage should be tailored
In a perfect world I’d like a storage that never loses any information, keeps the data in an efficient way without redundancy, provides out-scaling out-of-the-box and allows all possible selection criterions without additional costs (either space or load). Bad perfect world doesn’t exist and you should match your storage with your needs as each one is different. And what if there’s none that would meet all your requirements? Go for more than one.
I can imagine optimizing one type of storage for immediate queries that do work on real-time data (but they lack agility due to limited querying criterions - that’s the price of ad-hoc availability) and another storage that would enable us to perform any kind of analysis, but data is chunked and free-roaming queries may take some time.
Duplicating information ain’t perfect (setting up the Single Version of Truth is more than critical), but due to that:
- we can provide up-to-date and immediately queryable (using the set criteria) data in Apache Cassandra (http://cassandra.apache.org/)
- all the data will be kept for any other kind of analysis in well-known Apache Hadoop's (http://hadoop.apache.org/) HDFS - waiting for Pig (http://pig.apache.org/) digestion :)
Idea #8 - Visualize, visualize, visualize!
Although Pig or Hive (http://hive.apache.org/) are truly awesome, one could use something that won’t scare a non-tekkie away in 5 seconds. Something more user-friendly than a REPL or any other console window. There are several interesting options here - starting with popular LogStash (http://logstash.net/), counting in awesome, but quite expensive in enterprise environment - Splunk (http://www.splunk.com) and ending with OSSed Kibana (http://www.elasticsearch.org/overview/kibana/).
[ To be continued - we’ll be using really sharp tools and we’ll burn some bridges behind. ]