NOINDEX meta tag and googlebot

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

NOINDEX meta tag and googlebot

Will it remove already indexed pages?

Web_Savvy

7:29 pm on Dec 8, 2006 (gmt 0)

We did not expect (or want) the googlebot to crawl or index a class of pages on our site, but it did.

If we now put

on those pages, will they eventually get removed from the index?

If not, how to effect removal?

If yes, about how long would it take?

Thanks in advance for all (concrete ;-)) replies and suggestions.

tedster

9:20 pm on Dec 8, 2006 (gmt 0)

Yes, the meta robots tag will remove previously indexed urls. The last time I use this approach the removal happened in about ten days - it depends on how frequently those URLs normally get crawled. Google's information on removing URLs is here:

[google.com...]

The quickest removal method is to use Google's automatic URL removal system [services.google.com] combined with either a robots.txt disallow or a robots meta noindex. I just had results in three days using that approach.

Umbra

9:48 pm on Dec 8, 2006 (gmt 0)

According to my experience, Googlebot will not index a page with <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> but it will still crawl the page once a month or so (I guess to make sure that the metatags are still there).

Web_Savvy

12:20 am on Dec 9, 2006 (gmt 0)

Thanks for your helpful (and quick) replies, tedster and Umbra.

It is good to know for sure that meta ROBOTS noindex tag will cause already indexed urls to get removed soon(er or later), thanks again.

Thinking forward, here is some more background on our situation, followed by one more question:

We have a sizeable number of such pages (for a reason: users). Due to the structure of our CMS, they keep growing in large numbers, daily. However, they all have very little content, so obviously not good for the SEs. They also get (automatically) linked to, from other parts of the site so the crawlers find them quickly.

Under this scenario, if we rely only on meta noindex, I suppose the bots will still request each new page at least once. On the other hand, if we block this class of urls via robots.txt, the bots should not fetch them at all. Therefore, I believe the robots.txt exclusion will help us save some bandwidth and server load.

Am I thinking right here?

(Edit reason: fixed typo)

[edited by: Web_Savvy at 12:23 am (utc) on Dec. 9, 2006]

MThiessen

12:33 am on Dec 9, 2006 (gmt 0)

I guess to make sure that the metatags are still there

Yup. Sometimes, in rare situations you only want to do this temporarily and google recognizes this, that's why they re-crawl.

tedster

12:52 am on Dec 9, 2006 (gmt 0)

You've got the difference right between robot.txt and the robots metatag, too. They've got to download the html document to even SEE the meta tag, but the robots.txt says "don't even ask".

leadegroot

4:54 am on Dec 9, 2006 (gmt 0)

Bear in mind that a robots.txt entry will not stop Google indexing a 'blank' of the page, but they will be guessing the topic of the page from the context of the links to it..
Your first scheme is much better - meta noindex will stop the page appearing in the index and stop it diluting your search results.

I would only use the robots.txt option where the issue was bandwidth and Google crawling too many pages.

I normally block the other engines in the robots.txt file, as I'm not aware that they take the same approach as G, but allow Google to crawl wherever it wants, controlling it with a meta

HIH!
Lea

bumpski

9:11 am on Dec 9, 2006 (gmt 0)

From my experience and to clarify:

Assuming the page in question is indexed.
If you've already blocked Googlebot with robots.txt, then you add "robots=noindex" to the page itself, the page will remain in the Google Index!

As tedster mentioned, you must allow Googlebot to crawl the page (using robots.txt), so Googlebot can see the "robots=noindex" in the page itself, for the removal to actually occur.

It seems at times Google will still mistakenly index a "noindex" page for a short period of time. But you know you've done your job correctly when the source code of the page in the Google cache has "robots=noindex". One would think you'd never be able to see this! I certainly have many a times.

Web_Savvy

5:24 pm on Dec 10, 2006 (gmt 0)

You've got the difference right between robot.txt and the robots metatag, too.

Thanks for confirming this, tedster.

Lea, thanks for all the info, but I don't quite understand this:

Bear in mind that a robots.txt entry will not stop Google indexing a 'blank' of the page

Somebody care to explain this please?

Web_Savvy

5:38 pm on Dec 10, 2006 (gmt 0)

Assuming the page in question is indexed.
If you've already blocked Googlebot with robots.txt, then you add "robots=noindex" to the page itself, the page will remain in the Google Index!

Yes bumpski, you've brought out the catch22-like situation quite effectively.

If we put a 'meta noindex' tag on an already indexed page AND simultaneously block the bot with robots.txt, the indexed page will probably never get de-indexed.

If one wants to use robotx.txt in a situation like this, I guess a good solution would be to place meta noindex on the page(s) first and wait for it to get removed before blocking the bot via robots.txt.

Anyone care to confirm this please?

BTW, all this is very informative and interesting. Thanks for participating, everyone.