EDIT: I’m astounded by the number and quality of responses for appropriate tools...

syllogism · on June 22, 2015

You don't need to copy around your data into different groupings. Just group the doc IDs and rerun the analysis pipeline each time. If a part of the analysis pipeline is slow, it can be cached. But that's that module's business. Don't let it disrupt the interface.

If you want the analysis to be fast, some speed benchmarks with my software, spaCy: http://honnibal.github.io/spaCy/#speed-comparison

For what you want to do, the tokenizer is sufficient --- which runs at 5,000 documents per second. You can then efficiently export to a numpy array of integers, with tokens.count_by. Then you can base your counts on that, as numpy operations. Processing a few gb of text in this way should be fast enough that you don't need to do any caching. I develop spaCy on my MacbookAir, so it should run fine on your MBP.

As a general tip though, the way you're copying around your batches of data is definitely bad. It's really much better to structure the core of your program in a simple and clear way, so that caching is handled "transparently".

So, let's say you really did need to run a much more expensive analysis pipeline, so it was critical that each document was only analysed once.

You would still make sure that you had a simple function like:

    def analyse(doc_id):
        <do stuff>
        return <stuff>

So that you can clearly express what you want to do:

    def gather_statistics(batch):
        analyses = []
        for doc_id in batch:
            analyses.append(analyse(doc_id))
        <do stuff>
        return <stuff>

If the problem is that analyse(doc_id) takes too long, that's fine --- you can cache. But make sure that's something only the analyse(doc_id) deals with. It shouldn't complicate the interface to the function.