Forum Moderators: goodroi

Message Too Old, No Replies

Quick Question

When something is disallowed, do bots ignore it completely?

         

jlander

11:22 pm on Sep 6, 2005 (gmt 0)

10+ Year Member



If a specific page is disallowed, such as /page1.asp, will googlebot ignore it completely, as in never visit it, or will it look at it for any other reason besides adding it to its index?

Lord Majestic

1:27 am on Sep 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you disallow a page, it means that this page won't be crawled, however as the same URL get discovered again and again over time it means you will be getting extra chances to get that pages crawled if you changed robots.txt.

This is of course an educated guess as to how Google's architecture works.

jlander

1:56 am on Sep 7, 2005 (gmt 0)

10+ Year Member



So, if a page is disallowed, the page won't be added to the index (that I know), but also the images won't be found, and the links won't be followed?

I'm just trying to understand what disallowing actually does...

jdMorgan

2:02 am on Sep 7, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



A properly-designed robot will not fetch the file, in accordance with A Standard for Robot Exclusion [robotstxt.org].

However, Google, Yahoo, and Ask -- and perhaps MSN, have taken to listing some pages as a URL-only listing, or using link text from the page where the link was discovered, for pages that your robots.txt tells them not to fetch. Since the robots standard was written soley with bandwidth-conservation in mind, they're essentially taking advantage of this to "Spider the Deep Web." Some Webmasters think it's OK, some think it's part of the Grand Conspiracy. I think it's just a nuisance.

If you want a page "not mentioned," then allow it to be fetched in robots.txt, and then use the on-page HTML <name="robots" content="noindex"> tag to prevent it being listed.

I should note that while this used to work perfectly, I've had some trouble with Google still listing a few of those pages, and have resorted to periodically using their URL removal tool to de-list those pages.

Notwithstanding this (hopefully temporary) problem, it's a trade-off depending on what you hope to accomplish -- To reduce bandwidth consumed by spiders, or to keep low-value or non-optimal-landing pages out of the index.

If you're looking for security, then password-protecting the page is the way to go, or using user-agent- *and* IP-address-sensitive redirection or rewriting to keep the 'bots out of certain pages/directories. This latter approach is closely-related to cloaking, so be careful that no-one would interpret it as an attempt to deceive search engines or their users. You've also got to keep a sharp eye out for new spider User-agent names and new IP address ranges.

Jim

Clark

2:43 am on Sep 12, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Pursuant to your excellent suggestion,

I'm using php and this function to redirect:

header('Location: ' . $location);

Is there a php command I can use prior to that header command to add this one:
<meta name="robots" content="noindex,nofollow">