Why not just compare the content at the other end of the link with the contents of existing links.
It wouldn't be that hard. Whenever a link is submitted, YC's server would visit the link, get the response, strip all html tags and white space from it, then hash whatever is left. It would then store this hash value with the link. Whenever a new story is submitted, it is likewise hashed and then a check is made for an existing link with the same hash value. If it exists, it's a dupe, if not, allow it.
This would be an extra check to the existing dupe URL string of course. It still wouldn't catch every single thing, but it should eliminate quite a few easy dupes.
If that turns to have a low success rate, try hashing the page title or maybe the http headers.
It wouldn't be that hard. Whenever a link is submitted, YC's server would visit the link, get the response, strip all html tags and white space from it, then hash whatever is left. It would then store this hash value with the link. Whenever a new story is submitted, it is likewise hashed and then a check is made for an existing link with the same hash value. If it exists, it's a dupe, if not, allow it.
This would be an extra check to the existing dupe URL string of course. It still wouldn't catch every single thing, but it should eliminate quite a few easy dupes.
If that turns to have a low success rate, try hashing the page title or maybe the http headers.