This seems to be one of today's extremely interesting problems:
How can we process an enormously high rate of incoming data such that processing latency is minimal?
What are your thoughts?
Some points to consider:
1. High rate of inserts
If the incoming rate of new data is very high, and if we store this data for any length of time, then databases tend to struggle. Why? Because, as we know, inserting individual datapoints becomes a significant bottleneck when the database is large, and the rate of inserts is high.
2. Computation across full dataset
Any cleverness on our part, to discern patterns in the data, may require our regularly crunching sequentially through all (or, at least, significant amounts of) our data. Databases are optimised for random access. To achieve this, they have overheads which mean that they are not optimised for sequential reads.
It is because of points 1 and 2 that industry folk are loudly singing the praises of Hadoop when they consider high rates of incoming data which requires persistent storage - (HDFS, with its low-overhead chunk archiving) and its corresponding MapReduce computation framework.
So, problem's solved? Not so fast.
Let's consider:
A. Inherent latency in HDFS and MapReduce
We know also that HDFS has an associated latency which prevents real-time analysis. This is for four reasons:
i) Data is inserted into the HDFS in chunks. Until a chunk is formed, it cannot be inserted - and, accordingly, it cannot be analysed.
ii) MapReduce has a synchronisation lock after the map function, so no reduce outputs can be given until maps are all complete.
iii) Reduce pulls output data from Map, so pipelining is made inherently difficult - even without the synchronisation lock.
iv) Hadoop processing only occurs after data is written to disk,
B. Order
We also know that whilst HDFS with MapReduce is great for unstructured data, many data streams are ordered - at least chronologically. This order can (and often should?) be exploited when processing. Exploiting this order across HDFS is not trivially done.
It is because of points A and B that we should probably be thinking of in-memory stream processing in addition to Hadoop - if, indeed, we believe that the community is correct in thinking that Hadoop is the perfect fit for archived data which comes in at high volume, and which also needs processing.
What are your thoughts?
What's the business benefit? I don't think enough happens 'real time' for it to make much of a difference - reading a few blogs, news websites, digg / reddit & trending topics on twitter is enough to catch up on what happened in the world in a day.
Has anyone actually thought of anything that could make this useful?