Welcome to WebmasterWorld Guest from 54.221.87.97

Forum Moderators: open

Message Too Old, No Replies

FYI: New Yandex trick to ignore Robots.txt?

     

DeeCee

8:00 pm on Mar 23, 2012 (gmt 0)



Looking at access patterns from various search engines, and I noticed something strange that is either a trick on purpose, or a programmer 'oversight'. :)

I have some sites where Yandex is blocked. First hoping that it would follow the default in that robots.txt, which is Deny. Apparently it ignores any default listed.

Then I added a Deny in robots.txt specifically for the Yandex user-agent, which seemed to work for a while.

Then suddenly something new...

Multiple requests stopped right after checking robots.txt.. All seemed OK.
But then shortly after the bot (same agent-string, same US Yandex IP) came back, now reading the site's RSS feed (which should also be under the block), followed by a couple of seconds later loading away from the pages/links discovered in the feed.

It sure seems like they believe that if a link has been "published" through the RSS feed, then there is no need to validate them against robots.txt..

There sure are some strange interpretations of the robots.txt standard out there.

Strangely enough these "programmer mistakes" always seem to be negative for site-owners and a benefit for crawlers.

nomis5

9:03 pm on Jul 31, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Interesting post but why would you want to stop Yandex crawling your site?

phranque

11:40 pm on Jul 31, 2012 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



not addressing the why here, but how...

I added a Deny in robots.txt specifically for the Yandex user-agent

are you actually using a "Deny" directive in robots.txt?

if a link has been "published" through the RSS feed, then there is no need to validate them against robots.txt

robots.txt addresses requested resources, not referrers.

yandex has pretty good documentation on their implementation of robots.txt - Using robots.txt - Yandex.Help: webmaster:
http://help.yandex.com/webmaster/?id=1113851 [help.yandex.com]

yandex also has a robots.txt analysis tool - Yandex.Webmaster - Robots.txt analysis:
http://webmaster.yandex.com/robots.xml [webmaster.yandex.com]