homepage Welcome to WebmasterWorld Guest from 54.205.254.108
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / European Search Engines
Forum Library, Charter, Moderators: Rumbas

European Search Engines Forum

    
FYI: New Yandex trick to ignore Robots.txt?
DeeCee




msg:4432721
 8:00 pm on Mar 23, 2012 (gmt 0)

Looking at access patterns from various search engines, and I noticed something strange that is either a trick on purpose, or a programmer 'oversight'. :)

I have some sites where Yandex is blocked. First hoping that it would follow the default in that robots.txt, which is Deny. Apparently it ignores any default listed.

Then I added a Deny in robots.txt specifically for the Yandex user-agent, which seemed to work for a while.

Then suddenly something new...

Multiple requests stopped right after checking robots.txt.. All seemed OK.
But then shortly after the bot (same agent-string, same US Yandex IP) came back, now reading the site's RSS feed (which should also be under the block), followed by a couple of seconds later loading away from the pages/links discovered in the feed.

It sure seems like they believe that if a link has been "published" through the RSS feed, then there is no need to validate them against robots.txt..

There sure are some strange interpretations of the robots.txt standard out there.

Strangely enough these "programmer mistakes" always seem to be negative for site-owners and a benefit for crawlers.

 

nomis5




msg:4480702
 9:03 pm on Jul 31, 2012 (gmt 0)

Interesting post but why would you want to stop Yandex crawling your site?

phranque




msg:4480739
 11:40 pm on Jul 31, 2012 (gmt 0)

not addressing the why here, but how...

I added a Deny in robots.txt specifically for the Yandex user-agent

are you actually using a "Deny" directive in robots.txt?

if a link has been "published" through the RSS feed, then there is no need to validate them against robots.txt

robots.txt addresses requested resources, not referrers.

yandex has pretty good documentation on their implementation of robots.txt - Using robots.txt - Yandex.Help: webmaster:
http://help.yandex.com/webmaster/?id=1113851 [help.yandex.com]

yandex also has a robots.txt analysis tool - Yandex.Webmaster - Robots.txt analysis:
http://webmaster.yandex.com/robots.xml [webmaster.yandex.com]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / European Search Engines
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved