Robots and Redirects

Forum Moderators: goodroi

Message Too Old, No Replies

Robots and Redirects

Does disallow block a 301 redirection?

chewy

10:58 pm on Mar 25, 2015 (gmt 0)

Let's say I have a site where I block a certain directory.

such as:

User-agent: *
Disallow: /special/

What happens if I have a 301 redirect in place for an old file that points to a new file within that very same blocked directory?

such as:

redirect permanent oldsite.zom/special/blue-stuff.html [newsite.zom...]

will Google follow that 301 and start to spider + index that page alone? Or does it now have license to spider + index content within that directory?

or will it simply be blocked leaving the old SERP in Google?

...indications are that it will be blocked. Is that correct?

If that is true, are there any options - for instance can i use the allow directive for Googlebot to allow that single page within a directory that is blocked?

As always - thanks in advance!

phranque

12:12 am on Mar 26, 2015 (gmt 0)

googlebot and other well-behaved spiders will respect the robots exclusion protocol and not request any urls that match Disallowed patterns.
therefore googlebot will never see the 301 response.

can i use the allow directive for Googlebot to allow that single page within a directory that is blocked?

google, bing and most other bots support the Allow directive.

Robots.txt Specifications - Webmasters — Google Developers:
http://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [developers.google.com]
How to Create a Robots.txt File - Bing Webmaster Tools:
http://www.bing.com/webmaster/help/how-to-create-a-robots-txt-file-cb7c31ec [bing.com]
Using robots.txt — Yandex.Help. Webmaster:
http://help.yandex.com/webmaster/controlling-robot/robots-txt.xml [help.yandex.com]
blekko:
http://blekko.com/about/blekkobot [blekko.com]
Baiduspider:
http://help.baidu.com/question?prod_en=master&class=498&id=1000550 [help.baidu.com]
Majestic-12 : DSearch : MJ12bot:
http://www.majestic12.co.uk/projects/dsearch/mj12bot.php [majestic12.co.uk]

however, not all do.
for example, DuckDuckGo uses WWW::RobotRules (which adheres to The Robots Exclusion Protocol [robotstxt.org] and doesn't support the Allow directive):
http://metacpan.org/pod/WWW::RobotRules [metacpan.org]

lucy24

4:03 am on Mar 26, 2015 (gmt 0)

When you, as a human, meet a redirect, your browser sends you along to the new URL without asking if that's what you want. That's the browser doing its job. But a robot-- including a search-engine spider-- has a different job. It requests an URL and then makes note of the response. If the response is "I've moved, so go over here" (as in a 301 redirect), the robot then has to make a separate decision about whether to request this new URL. One thing that factors into the decision is whether the robot is, in fact, allowed to request the second URL.

Illustration: A while back, I moved my personal site to a new domain name. Since this meant that a certain roboted-out directory no longer existed at the old site, I removed the robots.txt block. As soon as search engines made this discovery, they went wild with excitement and requested all the pages they knew about in this directory. But all they got was a 301 telling them to go over to the new site ... where the equivalent URL was duly roboted-out. Net result: A lot of redirects, but no follow-up requests.

chewy

3:11 pm on Mar 26, 2015 (gmt 0)

Thanks Phrank + Lucy!