Forum Moderators: open

Message Too Old, No Replies

Robots META Tag

All and None and Google and more

         

Oaf357

9:49 pm on May 10, 2003 (gmt 0)

10+ Year Member



Does Google listen to ALL and NONE in the robots META tag?

Also, would Google (or any SE for that matter) listen to noindex, follow?

jimbeetle

9:58 pm on May 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



AFAIK Google obeys ALL and NONE and, as other SEs, noindex, follow.

But...Google's adherence to robots meta tag and robots.txt is a bit different than most SEs. If it knows that a page exist it will probably list it in some fashion, usually with just the url and no snippet or description.

The only way I've found to be semi-successful to have Google not list a page (besides a ban) is to both disallow in robots.txt and use noindex, nofollow in robots meta tag. But sometimes even that doesn't work.

Oh, and instead of using ALL, just don't include the robot meta tag at all.

Jim

deejay

10:04 pm on May 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I agree with jimbeetle, except on one point:

But...Google's adherence to robots meta tag and robots.txt is a bit different than most SEs. If it knows that a page exist it will probably list it in some fashion, usually with just the url and no snippet or description.

I have seen Google do this often, BUT when I tracked back through the logs... Google had not crawled the page in question, but was listing the URL because it had crawled a page that linked to it. The pages were not crawled because the sites were new and had very few incoming links to inspire G to crawl deep.

Invariably the next month those pages were crawled and either listed or not as their robots tags required.

jdMorgan

2:22 am on May 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Google and Ask Jeeves/Teoma behave differently than most other search engines.

If they find any link to your page, they will list it. If it is Disallowed in robots.txt, they will not fetch it. However, they will still list it in their SERPs using the URL for the title. Because it has not been crawled, it will not show up in the SERPs for any search terms, except for those used as link text in the links they found. It will also show up for any search terms matching the keywords-in-URL for that page, and for domain searches.

JimB, the way to keep pages completely out of these two SE's is to Allow (by not Disallowing) them in robots.txt, and use the <meta name="robots" content="noindex"> tag or any valid variant. You have to allow the page to be fetched by the spider in order for it to read the <robots> tag.

Google and AJ/T apparently interpret the word "index" to mean "fetch" when applied to robots.txt, but interpret the same word to mean "include in index" when applied to the <robots> meta tag.

I can't figure out why, but 'tis not for me to ask. That's how it works, and there is a work-around, so it's off my priority list. It took me four months to get my e-mail contact forms de-listed, but that's how I did it. To see if it's working or not, search for your own domain name and/or any applicable keywords-in-URL.

HTH,
Jim

rfgdxm1

2:43 am on May 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The basic purpose of these robots exclusions is to prevent search engines from wasting server resources and bandwidth where they are not wanted. Thus, if robots.txt excludes a bot, it is not allowed to spider. However, a SE can list just a URL without spidering it. There is nothing you can do about this, the same way as there is no way of stopping other sites from linking to your pages. Some SE may be more polite and not even list the URL. However, to do this the SE would first have to actually go to the site. If a SE finds a link to another site, it may list it even before it attempts to spider it.

jdMorgan

3:20 am on May 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



rfgdxm1,

> There is nothing you can do about this.

I posted the solution above:

>> JimB, the way to keep pages completely out of these two SE's is to Allow (by not Disallowing) them in robots.txt, and use the <meta name="robots" content="noindex"> tag or any valid variant.

Jim

g1smd

10:15 am on May 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



When I recently put up a new site, I needed some way for Google to find it quickly. I had a page on an old high PR site that had redundant information. I changed the content there, to simply have one paragraph introducing the content on the new site, and a link to that new site. It was there simply to get freshbot started. This worked, but with an unfortunate consequence. The "doorway" page got a better result than the main site, and for some terms was listed instead of the main site. A swift application of:
<meta name="robots" content="noindex,follow"> eventually sorted it out, so that the site is listed rather than the temporary doorway, but it took many weeks for this to happen.

jimbeetle

3:05 pm on May 11, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>>You have to allow the page to be fetched by the spider in order for it to read the <robots> tag.

That sure as heck makes sense, though think that's a variation I had until a few months ago and did not work. But only takes a second to take out the disallow to see what happens next deep crawl.

Thanks JD,

Jim