It is a testmanet to NLTK that this can be accomplished in less than 100 lines.
- In naive Bayes classification, model parameters can usually be estimated using relative frequencies in the training data.
- WordPunctTokenizer is a very simple tokenizer that makes anything matching \w+ and [^\w\s]+ a separate token.
- Extracting Bigrams from a list of tokens is trivial.
Of course, using NLTK will be very helpful in many situations, but this is hardly a testament to NLTK.
It is a testmanet to NLTK that this can be accomplished in less than 100 lines.