'Big data' is not among the most catchy buzzwords anymore. This is probably good - after the initial hype, there's a proper time to assess the practical applicability and tangible benefits.
Big data's epicenter was always the Hadoop ecosystem. It has several advantages, but TBH working with it was always a huge pain:
- it consists of many different, relatively independent projects ... with separate release cycles and versioning - setting up the working combination of compatible versions is not a lesser challenge ...
- these projects are in constant development, never truly complete - there's always something that feels half-baked (or 1/32-baked ...)
- its roots are deep in the Java ecosystem, which is not loved by that many these days ...
- it's a distributed ecosystem by design (no surprise ofc), and that adds even more complexity already at the level of initial setup and configuration ("so, now you need a Zookeeper cluster to keep the consensus on metadata")
- it's resource-demanding, and those resources can be really expensive (that was always an issue when I wanted to try out something quickly)
Stream fixation
Nevertheless, such pain points have never truly discouraged me from experimenting. But frankly, the products I was always most interested in were the ones that process streams of data in the near-real-time ('fast data' instead of 'big data'): Apache Storm, Apache Spark (Streaming), Apache Flink.
What's so alluring about data stream processing? Why the fixation on that concept? Because of the following perpendicularity parallel, that has resonated with me strongly:
Typical analytical queries (run on any analytical DB, e.g., a data warehouse) work like that:
- data is static
- queries are dynamic - run ad-hoc, on a given state of data
Whilst the typical streaming queries are perpendicular to that model:
- data is in constant motion (as it's streaming in) and ...
- ... it flows "through" the static queries: pre-determined (as they represent known conditions to be detected/collected)
Both scenarios have their practical applicability - but whereas the former is well known and commonly used, the latter hasn't been embraced by many yet, despite its tremendous potential - which is based on mixing stateful contextual aggregates with the real-time flow of hot (relevant because of its freshness) data.
An opportunity
The good news is that the entry threshold to start working with streaming data has dropped significantly. Not because the Hadoop ecosystem has matured so much (which it did), but thanks to cloud providers, who can take care of setup and all the cumbersome operations for you.
In the last few months, I had plenty of occasions to work with such a solution - Kinesis Data Analytics (KDA) on AWS. I think I've acquired a lot of exciting experience and knowledge, which is definitely worth sharing :) Hence I've decided to blog about it.
Disclaimer: I work for AWS. No, the post is not "sponsored". I did create it out-of-work, in my spare time, because I wanted to do so (not because someone told me to or asked me to do it). This is not an official statement or any kind of official AWS communication/content. Enjoy :)
What is KDA?
It's generally an oversimplification, but you can think about it as managed Flink on AWS cloud infrastructure. The service itself is not that new. It was released in 2016, and AWS added the Flink support in 2018 (initially only for Java). At that point, the programming model wasn't that much convenient - you had to build your application locally and deploy it in the cloud as a JAR file. Not the most suitable scenario for rapid experimentation, indeed.
Fortunately, service is being very actively developed. A few months ago, PyFlink support had been added (to already existing Java and Scala APIs), and recently - AWS has introduced an 'overlay' service called Kinesis Data Analytics Studio which (IMHO) truly changes the rules of the game.
DA Studio is pretty much a Zeppelin notebook running on the top of the Flink cluster. Confused? These names do not ring a bell for you?
OK, let me help you figure out what it means in practice.
- Apache Zeppelin notebooks are an interactive, web-based environment for rapid experimentation, with a plethora of interpreters you can use to connect to various DBs (JDBC, Cassandra, Kylin), platforms (Elasticsearch, Flink, Scalding), languages (Python, Scala, Angular). If you've ever seen Jupyter notebooks in action, the idea is the same. Just the implementation is different.
- In this case, the Zepellin notebook acts as a UI for Apache Flink. Without any IDE, without worrying about connecting (and securing the connection), you immediately get convenient access to the automatically provisioned Flink cluster. In a notebook, you can use Scala, Java, Python, or ... SQL) - all the necessary libraries and packages are already loaded.
- SQL option is available because of integrated Apache Calcite, which acts as an intermediate layer between the notebook and Flink's APIs. And again - you don't need to worry about setting it up, as it's available out of the box.
- OK, but what is the Apache Flink?! It's a distributed data processing engine that works for bounded and unbounded data collections - conceptually, these collections are treated as data streams.
- Flink offers three different APIs, built upon different sets of abstractions - DataStream API, DataSet API, and Table API. All of them are accessible via Kinesis DA Studio.
Kinesis DA Studio combines all these technologies in one 'package' that you can provision in less than 5 minutes. Then, using a lightweight, user-friendly web interface, you can:
- easily connect your Flink to already present data stream (e.g.: MSK - managed Kafka topic or Kinesis Data Stream)
- immediately start crafting stateless or stateful (!) transformations/projections for your streams
What's next?
If this sounds interesting, please make sure you won't miss the subsequent episodes in the series - there are far more topics I'll cover in the forthcoming posts:
- full setup of Kinesis DA Studio from scratch
- the basic concepts of Flink
- the differences and specifics of particular APIs
- the deployment model of Kinesis DA applications
- how do the queries for the streams differ from typical SQL queries
- window-based aggregations: what's that?
- stateful processing: how to do it, and what's possible
- practical scenarios you can tackle with Flink and Kinesis Data Analytics - with actual examples you can run in a notebook
Stay tuned, streams are coming.
The next post in the series (setting up an input Kinesis stream) is available here.