Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's a difficult problem to fix, you can set an Accept-Language header on crawl requests but his only works if the target website uses "Content Negotiation." Some sites ignore headers and determine language based on the IP address (Geo-IP) or the URL structure (e.g., /es/ vs /en/), basically a mess...




I don't get the problem you claim. You crawl something and get a document in whatever language the site delivers you. You know the language of that document with the lang=... attribute of the document. What results you show for a given language is under your control and not influenced by what the crawled site chose to serve to the crawler.

I'm working on the language improvements presently, but I need to clean out a lot of bad entries in my index. In essence what I am trying to say is many servers ignore "Accept-Language" so you have to rely on other means of detecting the language of the page reliably, e.g. inspecting the body content of the response. It's a non-trivial problem online.

So html lang=... is wrong, or doesn't exist?

> I am trying to say is many servers ignore "Accept-Language"

I wouldn't have expected that to be a hard rule, more like if there are multiple pages to return to have a factor, which one the user most likely wants.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: