Ask HN: How to analyse the real-time Web?

smiler · on Dec 22, 2009

Can I ask a non-technical question about analysing the real-time web...

What's the business benefit? I don't think enough happens 'real time' for it to make much of a difference - reading a few blogs, news websites, digg / reddit & trending topics on twitter is enough to catch up on what happened in the world in a day.

Has anyone actually thought of anything that could make this useful?

_3u10 · on Dec 22, 2009

If you can correlate "real-time web" trends to market movement there is a lot of money to be made.

ramanujan · on Dec 22, 2009

The realtime web is basically about breaking news. That's a significant proportion of search queries, perhaps 20-30% (you can quantify with the AOL query dataset).

I don't personally care so much about 5000 Tweets saying "Michael Jackson RIP". That's ultra-mainstream breaking news that you would hear about anyway.

Instead the interesting thing about Twitter is that it's gotten big enough to explore the tail of breaking news in areas I care about, e.g. conferences and events that I want to follow without being present. A lot of that stuff would never make a full news piece or journal article, but is important to gauge attitudes and trends.

Basically it reduces the threshold of a "least publishable unit", and in doing so unlocks a large amount of breaking news that would not otherwise get out there.

jgrahamc · on Dec 22, 2009

Has anyone actually thought of anything that could make this useful?

Yes. This is part of what we are doing at http://causata.com/. Understanding the actions of people in near-real time is important if you want to target them with the right information, ads, offers, etc.

FreeRadical · on Dec 22, 2009

But why does this have to be real time and not, say, 4 hours delayed? If a person is interested in soccer at 8am, then they probably still are interested in soccer later in the day.

I'd be interested to hear any counter examples.

leej · on Dec 22, 2009

it means you can display a soccer-related ad at 8am and know that he will be watching it.

FreeRadical · on Dec 22, 2009

But this occurs already with ad networks and has been going on for years, they track your previous behaviours and delivery 'relevant' ads.

dvd03 · on Dec 22, 2009

There are two companies which seek to provide real-time high volume analysis of data streams as a service: Truviso and StreamBase.

The target clients for both are principally, apparently, companies seeking to crunch financial data in quasi real time.

_3u10 · on Dec 22, 2009

Hadoop is designed as a batch processing framework not a real-time analysis framework. If your analysis functions are idempotent with regard to future data in the time domain simply compute summaries for each "block" down to the resolution supported for that age of data such that your summaries fit in memory. Save ALL the data to hadoop in case you need to replay it later. The answer to your question depends very much on whether you can summarize your data in the time domain. eg. if computing an average store block summaries as the average AND the number of items so that future summaries can be easily integrated. There is no one answer that will solve any possible analysis function, you'll need to optimize your system around the analysis function you want to perform and perhaps have a few different systems purpose built for different types of analysis.

dvd03 · on Dec 22, 2009

Let's consider a particular problem domain: analysis of global financial data - fixed income, stocks, derivatives, etc.

Agreed, Hadoop is a batch processing framework across a chunked archive. Work has been done recently to bring Hadoop "out of the past" - http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-13.... However, even with these latest amendments, the latency can prove more than a little troublesome for trading strategies which require rapid execution.

It is for this reason that companies like Truviso and StreamBase - both born out of highbrow academic research - have built in-memory stream-processing frameworks in addition to persistent data stores.

If we assume, for the sake of argument, that analysis of historical data is important, and that Hadoop is fit for purpose. And if we assume also, for the sake of argument, that a distributed in-memory processing facility is also important. Then which in-memory solution ought we to employ, and how ought we to relate this to the Hadoop solution which we'll also be using?

japherwocky · on Dec 22, 2009

you should try a few of them and see which one actually works better. Probably more importantly, you should figure out more specifically what you're trying to accomplish.

ramanujan · on Dec 22, 2009

I believe you mean "invariant with regard to future data in the time domain", i.e. f(x_1,..,x_t) == f(x_1,..,x_t,x_{t+1})

"Idempotent" means that f(f(x)) == f(x), which wouldn't apply unless the output of a given analysis had to be fed back into it. Most outputs are going to be tables of counts with the input being raw text data, so that wouldn't apply in this case.

ig1 · on Dec 22, 2009

Have a look at what the financial industry use, they've been tackling high volume low-latency realtime data feed analysis problems for the last couple of decades. Time series databases (such as kx/kdb) are popular as are propriety in-memory databases (often based on something like BerkleyDB).

jey · on Dec 22, 2009

What's the objective? We need to decide on what we want "analyze" to mean before we can evaluate/debate the merits of different techniques.

physcab · on Dec 22, 2009

Although I haven't really delved into it that much, HBase (and other "NoSQL" databases) supposedly address the latency and structured data points. I brought up HBase as opposed to Cassandra or Redis because HBase sits on top of HDFS and has a Hadoop/MapReduce API.