block with robots.txt or noindex tag

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

block with robots.txt or noindex tag

smokeybarnable

1:02 am on Sep 21, 2006 (gmt 0)

I blocked a certain page in my robots.txt and also put a no index tag on the same page. Now I notice google trying to crawl said page and my robots.txt is blocking it like it should. But I am wondering if I should let google spider it so it see's the no index tag and knows not to index it.

jdMorgan

1:44 am on Sep 21, 2006 (gmt 0)

The major search engines have changed their behaviour over the last year, so that the robots.txt Disallow and the <meta="robots" content="noindex"> results are essentially identical. In the past, you could use the meta tag to tell them "don't mention this URL in any way," but that no longer works, apparently. It fell victim to the "Deep Web" push, where search engines now dig into obscure corners of the Web. Pages having either a Disallow in robots.txt or a meta "noindex" tag can now appear as URL-only listings in the big three search engines.

That's too bad, as it has forced many to password-protect or cloak pages that the Webmaster simply does not want used as entry points to the site.

So, given that the robots must fetch a page to read the on-page meta-tag, using a robots.txt Disallow will at least keep the wasted bandwidth down, as was it's intent.

Jim

[edited by: jdMorgan at 1:45 am (utc) on Sep. 21, 2006]

lammert

12:12 pm on Sep 21, 2006 (gmt 0)

jdMorgan, I don't agree with your conclusion that "noindex" can leave a URL only listing in the index, at least not for Google.

Google has a specific "nosnippet" option for the robots meta for people who want a URL only listing for a page. It would be strange if they treat noindex and nosnippet as equal.

What often happens is that people overdo their robots blocking. If you put a "noindex" in the meta tag AND disallow crawling by Googlebot as the OP proposed, the bot doesn't read the content of that page and doesn't see the "noindex" meta tag. DisAllow in the robots.txt does however not prevent Google to add a URL only listing, or even a complete listing where the description of DMOZ is used as the snippet.

So to remove your content from the Google index, you have to allow Googlebot to read the file. If you block it in robots.txt, the content can appear in the index. Quite a contradiction, but it is how the rules work.

jdMorgan

12:40 pm on Sep 21, 2006 (gmt 0)

Lammert,

> If you put a "noindex" in the meta tag AND disallow crawling by Googlebot as the OP proposed, the bot doesn't read the content of that page and doesn't see the "noindex" meta tag.

Well yes, and I posted that same idea using different words above. But I have lots of URL-only listings in G right now, pointing to pages which are not Disallowed in robots.txt, and carry only the "noindex,nofollow" on-page meta tag, with no mention of nosnippet, noarchive, or noodp.

I have posted on WebmasterWorld several times in the past, explaining the heirarchy of robots.txt over on-page robots tags, and how to do it properly for various circumstances. But ever since G started talking about "The Deep Web," it has stopped working at G, and now at Y and M as well.

And I'm seeing this behaviour across multiple, diverse sites -- mine and others.

Jim

jdMorgan

1:37 pm on Sep 21, 2006 (gmt 0)

OK, after a quick review of the aforementioned sites, I need to print a partial retraction: Google and Yahoo have now apparently complied with my wishes, and have removed those pages not Disallowed by robots.txt, but with the <meta name="robots" content="noindex,nofollow"> on-page from their indices. However, MSN/Windows Live is still listing all of them.

Google and Yahoo did list those pages in the SERPs for a search on the domain name for several months, but no longer do so. Hopefully, this was a temporary glitch.

Jim

lammert

2:40 pm on Sep 21, 2006 (gmt 0)

Google and Yahoo did list those pages in the SERPs for a search on the domain name for several months, but no longer do so. Hopefully, this was a temporary glitch.

Yes, I remember that glitch, it was also discussed in a thread [webmasterworld.com] here. During a short time many of my "noindex" pages were visible in Google's SERPs.

g1smd

7:30 pm on Sep 21, 2006 (gmt 0)

In general, the noindex meta tag results in the page being spidered, and nothing at all appearing in the SERPs.

Using the robots.txt exclusion results in the page not being spidered, but still appearing as a URL-only lising in the SERPs - especially if someone links to it.

Yahoo goes further. They construct a title for that previously URL-only listing by using the anchor text of one of the links that points to that page, but only if that anchor text is not "click here" or some other generally poor quality text.