Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

so you will actually respect:

    User-agent: google
    Allow: /
    
    User-agent: *
    Disallow: /
well, yes it's good behavior to actually respect it, but well I've seen such robots.txt already which makes it really painful to create a competing search engine.


This has me ridiculously curious now. Is that common? Other than a random sampling of sites I go to, there a good way to get numbers on how often this is used?

Edit: In my scanning, I have to confess that wikipedia's robot file is the best. Fairly heavily commented on why the rules are there. https://en.wikipedia.org/robots.txt


I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites


I did run a Yacy web crawler (P2P websearch https://yacy.net) a while ago. As far I remember I just saw Yandex for a few times disallowed in the robots.txt when I had trouble with crawling a site. Mostly I just got an empty website for my Yacy crawler instead the "real" Website.


just do like browsers did with user agent strings. call your bot "botx (google crawler compatible)" and crawl everything that allows Google bot without any weight on your conscience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: