Some time ago, an interesting article has been published in Harvard Business Review - "Data Scientist: The Sexiest Job of the 21st Century". Some have nodded politely, some had a bit of laughter - very few treated it seriously. Ironically, it seems that the words in article's title were truly prophetic ... How did that happen?

What do you know about your users / customers?

So, your Enterprise has an IT system for everything, right? They are integrated, share data & provide joint interactions? Each of them keeps its own data repository & they provide star-schema-based reporting on transactional data via some BI? That's really great, but ... it's so 90s.

You can't do proper P&L, balance, cashflow reporting that way - that's right. You can make drill-downs using customer / product dimensions as well. But it's just a record of dry, bold facts - focused on what has already been achieved (sold products, active customers, etc.). The point is, that there's far more data to be collected & utilized than pure transactional data - for instance - all the users actions & interactions, including:

  • started, but not finished transactions (product sales, etc.)
  • reactions to direct marketing
  • recognized product interest

This data represents users' non-expressed intentions, non-articulated needs, decisions that are to be taken yet, situations when user hesitates & needs help to make the decision. The only way to get this extremely valuable information (& know your customer better!) is to digest this data.

Data != Information != Knowledge

Collecting data is not a target, it's barely the beginning:

Raw data doesn't bring value by itself - it's rubbish unless further processed. Sadly it's the step when most of Enterprises stop, satisfied with what they already have. Some don't even reach that far - they don't collect non-transactional data at all.

The next step is turning data into information: structured form that can be processed. In theory, this step doesn't require much effort, but deep, detailed knowledge about source system (& data generation) is required (long-term, it has to be properly maintained).

Information is not knowledge yet. Knowledge is burried deeply within the piles of information, waiting to be properly identified, distilled & validated using scientific methods.

Let me give you an example:

  1. Collected user's raw geo-location is just data.

  2. This geo-data when parsed & put it proper context (matched with time of a day, estimated speed) is information that may be later put into some use.

  3. Using this information you can:

    • learn individual's everyday routes - for instance, to predict his next actions (and remind him or warn him about something)
    • match this information with communication routes & timetables of public transporation to shorten the travel time
    • turn individuals attention towards particular point of interest (located near or sthng)
    • find groups of users that behave in a similar way (have similar preferences, needs, habits)
    • predict effects of accumulated activity of population of users (for instance: traffic jams)

& much more.

This is what Data Scientists do.

New era, new tools, new techniques

And all this crunching ain't that easy. Due to massive volume of data and / or its value decaying in time (at least some aspects of value) tools of yesterday are not applicable anymore. As nature abhors a vacuum, a new wave of completely different tools arised like a tsunami:

  • BigQuery
  • Hadoop
  • Spark
  • Impala
  • Samza
  • Storm
  • Druid
  • Mahout

& many more. They aren't just new, they are DIFFERENT. You need different skills, different approach, different mindset to utilize them properly. What is more, these tools didn't have decades to mature - the market is different & ambitious companies that really desire for knowledge adopt them even before they reach 1.0 RTM.

Forget GUIs. Yeah, just like that - what you can expect is (rather low level) API and / or more or less friendly DSL(s). It comes with a great power & potential for things that were beyond the reach so far, but it needs much more effort up-front. For some it may be a show-stopper, but let quote W. Edwards Deming here:

Learning is not compulsory... neither is survival.

End of part I, to be continued soon.