joined:Dec 1, 2011
Looking at access patterns from various search engines, and I noticed something strange that is either a trick on purpose, or a programmer 'oversight'. :)
I have some sites where Yandex is blocked. First hoping that it would follow the default in that robots.txt, which is Deny. Apparently it ignores any default listed.
Then I added a Deny in robots.txt specifically for the Yandex user-agent, which seemed to work for a while.
Then suddenly something new...
Multiple requests stopped right after checking robots.txt.. All seemed OK.
But then shortly after the bot (same agent-string, same US Yandex IP) came back, now reading the site's RSS feed (which should also be under the block), followed by a couple of seconds later loading away from the pages/links discovered in the feed.
It sure seems like they believe that if a link has been "published" through the RSS feed, then there is no need to validate them against robots.txt..
There sure are some strange interpretations of the robots.txt standard out there.
Strangely enough these "programmer mistakes" always seem to be negative for site-owners and a benefit for crawlers.