I’m astounded by the number and quality of responses for appropriate tools. Thank you HN! To shed a little more light on the project:
I’m compiling research for a sociology / gender studies project that takes a national snapshot of romantic posts from classifieds sites / sources across the country, and then uses that data to try and draw meaningful insights about how different races, classes, and genders of Americans engage romantically online.
I’ve already run some basic text-processing algorithms (tf-idf, certain term frequency lists, etc) on smaller txt files that represent the content for a few major U.S metros and discovered some surprises that I think warrant a larger follow-up. So I have a few threads I want to investigate already, but I also don’t want to be blind to interesting comparisons that can be drawn between data sets, now that I have more information (that’s why I’m asking for a bit of a grab-bag of text-processing abilities).
My problem is that the techniques from the first phase (analyzing a few metros) didn’t scale with the larger data set: The entire data set is only 2GB of text, but it started maxing my memory as I recopied the text files over and over again into different groupings. Starting with a datastore from the beginning would also have worked, but it just wasn’t necessary at the beginning of the project.
My current setup:
Python’s Beautiful Soup + CasperJS for scripting (which is done)
Node, relying primarily on the excellent NLP package “natural,” for analysis
Bash to tie things together
My personal MBP as the environment
SO given the advice expressed in the thread (and despite my love of Shiny New Things), a combination of shell scripts and awk (a CL language specifically for structured text files!), which I had heard about before but thought was a networking tool, will probably work best, backed up by a 1TB or similar external drive, which I could use anyway (and would be more secure). I have the huge luxury of course that this is a one-time research-oriented project, and not something I need to worry about being performant, etc.
I will of course look into a lot of the solutions provided here regardless, as something (especially along the visualizations angle) could prove more useful, and it’s all fascinating to me.
You don't need to copy around your data into different groupings. Just group the doc IDs and rerun the analysis pipeline each time. If a part of the analysis pipeline is slow, it can be cached. But that's that module's business. Don't let it disrupt the interface.
For what you want to do, the tokenizer is sufficient --- which runs at 5,000 documents per second. You can then efficiently export to a numpy array of integers, with tokens.count_by. Then you can base your counts on that, as numpy operations. Processing a few gb of text in this way should be fast enough that you don't need to do any caching. I develop spaCy on my MacbookAir, so it should run fine on your MBP.
As a general tip though, the way you're copying around your batches of data is definitely bad. It's really much better to structure the core of your program in a simple and clear way, so that caching is handled "transparently".
So, let's say you really did need to run a much more expensive analysis pipeline, so it was critical that each document was only analysed once.
You would still make sure that you had a simple function like:
def analyse(doc_id):
<do stuff>
return <stuff>
So that you can clearly express what you want to do:
def gather_statistics(batch):
analyses = []
for doc_id in batch:
analyses.append(analyse(doc_id))
<do stuff>
return <stuff>
If the problem is that analyse(doc_id) takes too long, that's fine --- you can cache. But make sure that's something only the analyse(doc_id) deals with. It shouldn't complicate the interface to the function.
I’m astounded by the number and quality of responses for appropriate tools. Thank you HN! To shed a little more light on the project:
I’m compiling research for a sociology / gender studies project that takes a national snapshot of romantic posts from classifieds sites / sources across the country, and then uses that data to try and draw meaningful insights about how different races, classes, and genders of Americans engage romantically online.
I’ve already run some basic text-processing algorithms (tf-idf, certain term frequency lists, etc) on smaller txt files that represent the content for a few major U.S metros and discovered some surprises that I think warrant a larger follow-up. So I have a few threads I want to investigate already, but I also don’t want to be blind to interesting comparisons that can be drawn between data sets, now that I have more information (that’s why I’m asking for a bit of a grab-bag of text-processing abilities).
My problem is that the techniques from the first phase (analyzing a few metros) didn’t scale with the larger data set: The entire data set is only 2GB of text, but it started maxing my memory as I recopied the text files over and over again into different groupings. Starting with a datastore from the beginning would also have worked, but it just wasn’t necessary at the beginning of the project.
My current setup: Python’s Beautiful Soup + CasperJS for scripting (which is done) Node, relying primarily on the excellent NLP package “natural,” for analysis Bash to tie things together My personal MBP as the environment
SO given the advice expressed in the thread (and despite my love of Shiny New Things), a combination of shell scripts and awk (a CL language specifically for structured text files!), which I had heard about before but thought was a networking tool, will probably work best, backed up by a 1TB or similar external drive, which I could use anyway (and would be more secure). I have the huge luxury of course that this is a one-time research-oriented project, and not something I need to worry about being performant, etc.
I will of course look into a lot of the solutions provided here regardless, as something (especially along the visualizations angle) could prove more useful, and it’s all fascinating to me.
Thanks again HN for all of your help.