Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is an interesting exercise in building a very specific word disambiguator ('apple' the company vs 'apple' the fruit).

It is a testmanet to NLTK that this can be accomplished in less than 100 lines.



Maybe apart from stemming), it's not hard to implement this in ~100 lines without NLTK.:

- In naive Bayes classification, model parameters can usually be estimated using relative frequencies in the training data.

- WordPunctTokenizer is a very simple tokenizer that makes anything matching \w+ and [^\w\s]+ a separate token.

- Extracting Bigrams from a list of tokens is trivial.

Of course, using NLTK will be very helpful in many situations, but this is hardly a testament to NLTK.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: