Forum Moderators: goodroi
For example, lets say someone's site links to www.domain.com/hello.php :
www.domain.com/hello.php
--------------------------
<?php
header("location:http://www.domain.com/disallowed/index.html");
?>
robots.txt
----------
User-agent: *
Disallow: /disallowed
Will the spider index the page from the disallowed directory?
My guess is that it will index it because robots.txt will only keep it from requesting the page directly, and in this case it didn't request the page by name, but was instead presented with it "through" a different (allowed) link.
Does anyone have any experience with this who can answer for sure?
However, many spiders such as Google, Yahoo, and Ask Jeeves/Teoma, will list a URL-only result in their SERPs if they "know about" the URL, but are disallowed by robots.txt from actually fetching the page. In Yahoo's case, they will use the link text they found with the link (if any) to create a listing.
A partial solution is to allow the page to be spidered, but include a <meta name="robots" content="noindex"> tag on the page. However, I've seen Google ignore this occasionally as well, and include a URL-only listing anyway.
Jim