Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How to analyse the real-time Web?
17 points by dvd03 on Dec 22, 2009 | hide | past | favorite | 15 comments
This seems to be one of today's extremely interesting problems:

How can we process an enormously high rate of incoming data such that processing latency is minimal?

What are your thoughts?

Some points to consider:

1. High rate of inserts

If the incoming rate of new data is very high, and if we store this data for any length of time, then databases tend to struggle. Why? Because, as we know, inserting individual datapoints becomes a significant bottleneck when the database is large, and the rate of inserts is high.

2. Computation across full dataset

Any cleverness on our part, to discern patterns in the data, may require our regularly crunching sequentially through all (or, at least, significant amounts of) our data. Databases are optimised for random access. To achieve this, they have overheads which mean that they are not optimised for sequential reads.

It is because of points 1 and 2 that industry folk are loudly singing the praises of Hadoop when they consider high rates of incoming data which requires persistent storage - (HDFS, with its low-overhead chunk archiving) and its corresponding MapReduce computation framework.

So, problem's solved? Not so fast.

Let's consider:

A. Inherent latency in HDFS and MapReduce

We know also that HDFS has an associated latency which prevents real-time analysis. This is for four reasons: i) Data is inserted into the HDFS in chunks. Until a chunk is formed, it cannot be inserted - and, accordingly, it cannot be analysed. ii) MapReduce has a synchronisation lock after the map function, so no reduce outputs can be given until maps are all complete. iii) Reduce pulls output data from Map, so pipelining is made inherently difficult - even without the synchronisation lock. iv) Hadoop processing only occurs after data is written to disk,

B. Order

We also know that whilst HDFS with MapReduce is great for unstructured data, many data streams are ordered - at least chronologically. This order can (and often should?) be exploited when processing. Exploiting this order across HDFS is not trivially done.

It is because of points A and B that we should probably be thinking of in-memory stream processing in addition to Hadoop - if, indeed, we believe that the community is correct in thinking that Hadoop is the perfect fit for archived data which comes in at high volume, and which also needs processing.

What are your thoughts?



Can I ask a non-technical question about analysing the real-time web...

What's the business benefit? I don't think enough happens 'real time' for it to make much of a difference - reading a few blogs, news websites, digg / reddit & trending topics on twitter is enough to catch up on what happened in the world in a day.

Has anyone actually thought of anything that could make this useful?


If you can correlate "real-time web" trends to market movement there is a lot of money to be made.


The realtime web is basically about breaking news. That's a significant proportion of search queries, perhaps 20-30% (you can quantify with the AOL query dataset).

I don't personally care so much about 5000 Tweets saying "Michael Jackson RIP". That's ultra-mainstream breaking news that you would hear about anyway.

Instead the interesting thing about Twitter is that it's gotten big enough to explore the tail of breaking news in areas I care about, e.g. conferences and events that I want to follow without being present. A lot of that stuff would never make a full news piece or journal article, but is important to gauge attitudes and trends.

Basically it reduces the threshold of a "least publishable unit", and in doing so unlocks a large amount of breaking news that would not otherwise get out there.


Has anyone actually thought of anything that could make this useful?

Yes. This is part of what we are doing at http://causata.com/. Understanding the actions of people in near-real time is important if you want to target them with the right information, ads, offers, etc.


But why does this have to be real time and not, say, 4 hours delayed? If a person is interested in soccer at 8am, then they probably still are interested in soccer later in the day.

I'd be interested to hear any counter examples.


it means you can display a soccer-related ad at 8am and know that he will be watching it.


But this occurs already with ad networks and has been going on for years, they track your previous behaviours and delivery 'relevant' ads.


There are two companies which seek to provide real-time high volume analysis of data streams as a service: Truviso and StreamBase.

The target clients for both are principally, apparently, companies seeking to crunch financial data in quasi real time.


Hadoop is designed as a batch processing framework not a real-time analysis framework. If your analysis functions are idempotent with regard to future data in the time domain simply compute summaries for each "block" down to the resolution supported for that age of data such that your summaries fit in memory. Save ALL the data to hadoop in case you need to replay it later. The answer to your question depends very much on whether you can summarize your data in the time domain. eg. if computing an average store block summaries as the average AND the number of items so that future summaries can be easily integrated. There is no one answer that will solve any possible analysis function, you'll need to optimize your system around the analysis function you want to perform and perhaps have a few different systems purpose built for different types of analysis.


Let's consider a particular problem domain: analysis of global financial data - fixed income, stocks, derivatives, etc.

Agreed, Hadoop is a batch processing framework across a chunked archive. Work has been done recently to bring Hadoop "out of the past" - http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-13.... However, even with these latest amendments, the latency can prove more than a little troublesome for trading strategies which require rapid execution.

It is for this reason that companies like Truviso and StreamBase - both born out of highbrow academic research - have built in-memory stream-processing frameworks in addition to persistent data stores.

If we assume, for the sake of argument, that analysis of historical data is important, and that Hadoop is fit for purpose. And if we assume also, for the sake of argument, that a distributed in-memory processing facility is also important. Then which in-memory solution ought we to employ, and how ought we to relate this to the Hadoop solution which we'll also be using?


you should try a few of them and see which one actually works better. Probably more importantly, you should figure out more specifically what you're trying to accomplish.


I believe you mean "invariant with regard to future data in the time domain", i.e. f(x_1,..,x_t) == f(x_1,..,x_t,x_{t+1})

"Idempotent" means that f(f(x)) == f(x), which wouldn't apply unless the output of a given analysis had to be fed back into it. Most outputs are going to be tables of counts with the input being raw text data, so that wouldn't apply in this case.


Have a look at what the financial industry use, they've been tackling high volume low-latency realtime data feed analysis problems for the last couple of decades. Time series databases (such as kx/kdb) are popular as are propriety in-memory databases (often based on something like BerkleyDB).


What's the objective? We need to decide on what we want "analyze" to mean before we can evaluate/debate the merits of different techniques.


Although I haven't really delved into it that much, HBase (and other "NoSQL" databases) supposedly address the latency and structured data points. I brought up HBase as opposed to Cassandra or Redis because HBase sits on top of HDFS and has a Hadoop/MapReduce API.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: