Hadoop is designed as a batch processing framework not a real-time analysis fram...

dvd03 · on Dec 22, 2009

Let's consider a particular problem domain: analysis of global financial data - fixed income, stocks, derivatives, etc.

Agreed, Hadoop is a batch processing framework across a chunked archive. Work has been done recently to bring Hadoop "out of the past" - http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-13.... However, even with these latest amendments, the latency can prove more than a little troublesome for trading strategies which require rapid execution.

It is for this reason that companies like Truviso and StreamBase - both born out of highbrow academic research - have built in-memory stream-processing frameworks in addition to persistent data stores.

If we assume, for the sake of argument, that analysis of historical data is important, and that Hadoop is fit for purpose. And if we assume also, for the sake of argument, that a distributed in-memory processing facility is also important. Then which in-memory solution ought we to employ, and how ought we to relate this to the Hadoop solution which we'll also be using?

japherwocky · on Dec 22, 2009

you should try a few of them and see which one actually works better. Probably more importantly, you should figure out more specifically what you're trying to accomplish.

ramanujan · on Dec 22, 2009

I believe you mean "invariant with regard to future data in the time domain", i.e. f(x_1,..,x_t) == f(x_1,..,x_t,x_{t+1})

"Idempotent" means that f(f(x)) == f(x), which wouldn't apply unless the output of a given analysis had to be fed back into it. Most outputs are going to be tables of counts with the input being raw text data, so that wouldn't apply in this case.