Top 100 words in News.YC titles

ivankirigin · on Jan 9, 2008

I presume you're implementing search?

pg · on Jan 10, 2008

Good guess.

chris_l · on Jan 10, 2008

Take a look at SOLR: http://lucene.apache.org/solr/ that makes indexing very easy

gojomo · on Jan 9, 2008

Dr. Seuss wrote 'Green Eggs and Ham' with only 50 words. Can these 100 be strung together (allowing repetition) into something remotely meaningful and grammatical?

imp · on Jan 10, 2008

It's hard without connecting words. I was able to come up with a few headlines though:

- Google launches first big company platform using python where people make money over hacker life.

- Lisp application time better vs. javascript, python, ruby.

- Startup entrepreneur launches better open source ruby tech blog.

*edit, I just now saw the other headlines by readers, but I don't feel like submitting these as actual stories.

arasakik · on Jan 9, 2008

Interesting. I wonder what the results would look like if the popularity (in points) of the submissions were incorporated.

DanielBMarkham · on Jan 10, 2008

So now that we have the most popular keywords, who can make a title with the most keywords in it?

http://news.ycombinator.com/item?id=96660

brk · on Jan 10, 2008

Mine: http://news.ycombinator.com/item?id=96753

german · on Jan 10, 2008

nice contest, here is mine :P

http://news.ycombinator.com/item?id=96663

jakewolf · on Jan 9, 2008

I like the end: "problem platform. next website need computer better."

ivankirigin · on Jan 9, 2008

Note to self: like best hacker make way good (10) ruby application time future

danielha · on Jan 9, 2008

Is that listed by frequency? If news.yc existed in 2000 I wonder what the list would look like.

pg · on Jan 9, 2008

Yes; I changed the page to clarify. Though in fact it would be hard to generate such a list without it being in order of frequency.

emfle · on Jan 11, 2008

Heh, this is algorithm bait.

The obvious algorithm is to sort all the words in order of increasing frequency, then print the 100 first. This algorithm is O(n log n) because of the sorting. It generates the list in order of frequency.

There is an O(n) algorithm that first picks some random item as pivot, then does a partioning like QuickSort where items with higher frequency than the pivot are moved to the front of the array and those with lower frequency to the back.

If the first partition has more than a 100 items, then the algorithm only has to recurse into that part. If it has fewer (k), then it prints everything in the first partition, and recurses into the second to generate the 100 - k best items.

This is expected O(n) and will not generate the list in order of frequency.

robg · on Jan 9, 2008

A fun, informative feature could be that list for each user in their profile. Even better if it updates based on use - how the hivemind splinters into the specific bees. There's so much that could be done there - similar users, related posts, discovery. I'm surprised we have yet to see that basic functionality in the more nebulous news sites.

rokhayakebe · on Jan 9, 2008

This is the only crowd where "money" comes almost last.

brent · on Jan 10, 2008

Granted its ambiguous, but million closer to the top than money. Its also easier to say :-).

brk · on Jan 10, 2008

Check back in a few years ;)

german · on Jan 10, 2008

There is YC, thought I'd find PG also.

lst · on Jan 10, 2008

(Apparently someone forgot to add 'remove #\|').