Within last week only I’ve encountered two interesting (and very similar) design cases (two different projects, two different teams):
-
System will be spammed with a constant stream of simple data messages in coherent format. Each data sample will contain few simple metrics (mainly currency, price and quantity). Project team will have to propose a solution that is able to detect suspicious cases (like: new order for given currency has got the price that is more than 20% higher than average price for last hour). Their plan is to record data in SQL and perform suitable queries in pre-set time intervals.
-
CRM system will detect customer behavior patterns to detect business opportunities and react as quickly as it’s possible with a new offer / contact recommendation, etc. Opportunity detection will be based on prepared set of queries they plan to execute on SQL database in pre-set time intervals.
Problems are quite similar and proposed solutions are pretty much the same. Why do I bother you then? Because there’s one fundamental mistake being committed here. Instead of simple BI, project teams should use CEP (Complex Event Processing). Why?
Let’s start with the simplest definition of BI (from the technical perspective, forget about the value provided for users for a moment) that is possible (while remaining valid). BI is about closed set of data and performing various dynamic queries in particular moments of time. Data structure is rigid, data is closed (aggregated / transformed / whatever), queries can be changed (adjusted), but until you perform a queries, you don’t get any knowledge (and it may happen that a query takes a lot of time to complete).
Is this the way of doing things we want to use in those particular two cases? Not really. Our data will change every second (or even more frequently) and our data set will never be “complete” / “ready for processing” (or rather - it’s ready all the time, but every second “ready in a different way”, because it changes). To be accurate and to remain able to react immediately, we’d have to perform queries all the time, more than once per second. What is even worse - if we look back at scenario #1: what we have to do is to compare event’s data to averages from last hour (this timeframe is constantly moving all the time) - that would require creating and constantly updating particular average aggregates. That’s a huge load for the database.
That’s why we have CEP - Complex Event Processing. It’s a variant of BI where queries are the rigid part and data just “flows” through them, so they can analyze data (or rather “the events”), catagorize them, hierarchize them, find patterns that are being searched for, etc. CEP is a foundation of real-time systems - systems that are able to respond to complex business conditions immediately (bringing incredible value to our clients). According to David Luckham (“Event processing for business: Organizing the real-time enterprise”) since 2009 we are entering the maturity era of CEP, why don’t use it when we have the opportunity then?
What to start with? Are there any particular tools we can use? You could start with looking at those two:
-
If you’re already working in .NET and you have SQL Server “under-the-hood”, you can check Stream Insight (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/complex-event-processing.aspx)
-
If you’re not bound to any particular database, but you’re still in the world of .NET (or Java) - try Esper / Nesper (http://esper.codehaus.org/about/nesper/nesper.html)
I hope this short intro will enhance your appetite for CEP, but I’ll definitely write more about that in nearest future.