so you will actually respect: User-agent: google Allow: / User-agent: * Disallow...

taeric · on July 5, 2018

This has me ridiculously curious now. Is that common? Other than a random sampling of sites I go to, there a good way to get numbers on how often this is used?

Edit: In my scanning, I have to confess that wikipedia's robot file is the best. Fairly heavily commented on why the rules are there. https://en.wikipedia.org/robots.txt

benfrederickson · on July 5, 2018

I analyzed the top 1 million robots.txt files looking for sites that allow google and block everyone else here: https://www.benfrederickson.com/robots-txt-analysis/ - it's a relatively common pattern for major websites

_trampeltier · on July 5, 2018

I did run a Yacy web crawler (P2P websearch https://yacy.net) a while ago. As far I remember I just saw Yandex for a few times disallowed in the robots.txt when I had trouble with crawling a site. Mostly I just got an empty website for my Yacy crawler instead the "real" Website.

gcb0 · on July 5, 2018

just do like browsers did with user agent strings. call your bot "botx (google crawler compatible)" and crawl everything that allows Google bot without any weight on your conscience.