It seems to me that you are driving from the wrong direction. Given that you hav...

danso · on June 18, 2015

If I could upvote your comment 10 times, I would. It doesn't seem that the OP knows what exactly they want, and so the question of whether they should be doing things in Node/Python/Bash/SQL/NoSQL is putting the cart way before the horse. It's quite possible grep -- and some regex, and maybe some scripting logic for tokenizing -- is all that's needed, if the hypothesis is clear enough. But right now, it just sounds like OP has a ton of text and is hoping that some tool can somehow produce an unanticipated, useful insight...in my experience, that rarely (OK, never, actually) happens.

ganeshkrishnan · on June 18, 2015

I was wondering what exactly is OP looking for. Most probably he forgot to put his problem in.

If he just wants to parse the text his bare minimum ubuntu install is enough with grep, cat etc.

contingencies · on June 18, 2015

Yep.

Success really depends on the conception of the problem, the design of the system, not in the details of how it's coded. - Leslie Lamport

You're not going to come up with a simple design through any kind of coding techniques or any kind of programming language concepts. Simplicity has to be achieved above the code level before you get to the point which you worry about how you actually implement this thing in code. - Leslie Lamport

You're not going to find the best algorithm in terms of computational complexity by coding. - Leslie Lamport

Sometimes the problem is to discover what the problem is. - Gordon Glegg, 'The Design of Design' (1969)

The besetting mistake of expert designers is not designing the thing wrong, but designing the wrong thing. - Frederick P. Brooks, 'The Design of Design: Essays from a Computer Scientist' (2010)

... then again ...

In practice, designing seems to proceed by oscillating between sub-solution and sub-problem areas, as well as by decomposing the problem and combining sub-solutions. - Nigel Cross

... but did you want re-use? ...

A general-purpose product is harder to design well than a special-purpose one. - Frederick P. Brooks, 'The Design of Design: Essays from a Computer Scientist' (2010)

Design up front for reuse is, in essence, premature optimization. - AnimalMuppet

... lines of thought from my fortune clone @ https://github.com/globalcitizen/taoup

quizotic · on June 18, 2015

Man, this comment is simultaneously both SO right and SO wrong. I guess the tools we have today are set up for knowing what you want to learn. Named Entity Retrieval? Topic classification? Sentiment Extraction?

OTOH, why should OP have to know what's interesting a-priori? The mantra of big data is "listen to the data, let it tell you, leave your preconceived notions at the door." The fault is with our current tool chain. We need something that tells us what about the text is interesting before we dive in for a closer look.

I'm imagining a tool that told me: "this text seems to have a lot of opinions and sentiment," "it's about a product that was returned," "a number of people's names are mentioned," "this text was loaded along with some structured data that appears to reference a price, a location, and a date."

Why is it such a stretch to combine the tools we already have to generate and push summarizations? Maybe it's just the cost of computation? If you know you're looking for topic and don't care about sentiment, then you can avoid paying for it?

abannin · on June 18, 2015

I'd like to understand a bit more about your data structures. Raw text should live in a file system, but if you're creating highly structured data using a SQL database could be a good fit. I'd also suggest taking a look at Spark; you can run it locally on your machine (does not need a Hadoop cluster) and can interact with flat files. If you're finding that your machine is lacking in horsepower or space, AWS could be your friend.