Forum Moderators: Robert Charlton & goodroi
If we now put
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
on those pages, will they eventually get removed from the index?
If not, how to effect removal?
If yes, about how long would it take?
Thanks in advance for all (concrete ;-)) replies and suggestions.
[google.com...]
The quickest removal method is to use Google's automatic URL removal system [services.google.com] combined with either a robots.txt disallow or a robots meta noindex. I just had results in three days using that approach.
It is good to know for sure that meta ROBOTS noindex tag will cause already indexed urls to get removed soon(er or later), thanks again.
Thinking forward, here is some more background on our situation, followed by one more question:
We have a sizeable number of such pages (for a reason: users). Due to the structure of our CMS, they keep growing in large numbers, daily. However, they all have very little content, so obviously not good for the SEs. They also get (automatically) linked to, from other parts of the site so the crawlers find them quickly.
Under this scenario, if we rely only on meta noindex, I suppose the bots will still request each new page at least once. On the other hand, if we block this class of urls via robots.txt, the bots should not fetch them at all. Therefore, I believe the robots.txt exclusion will help us save some bandwidth and server load.
Am I thinking right here?
(Edit reason: fixed typo)
[edited by: Web_Savvy at 12:23 am (utc) on Dec. 9, 2006]
I would only use the robots.txt option where the issue was bandwidth and Google crawling too many pages.
I normally block the other engines in the robots.txt file, as I'm not aware that they take the same approach as G, but allow Google to crawl wherever it wants, controlling it with a meta
HIH!
Lea
Assuming the page in question is indexed.
If you've already blocked Googlebot with robots.txt, then you add "robots=noindex" to the page itself, the page will remain in the Google Index!
As tedster mentioned, you must allow Googlebot to crawl the page (using robots.txt), so Googlebot can see the "robots=noindex" in the page itself, for the removal to actually occur.
It seems at times Google will still mistakenly index a "noindex" page for a short period of time. But you know you've done your job correctly when the source code of the page in the Google cache has "robots=noindex". One would think you'd never be able to see this! I certainly have many a times.
You've got the difference right between robot.txt and the robots metatag, too.
Thanks for confirming this, tedster.
Lea, thanks for all the info, but I don't quite understand this:
Bear in mind that a robots.txt entry will not stop Google indexing a 'blank' of the page
Somebody care to explain this please?
Assuming the page in question is indexed.
If you've already blocked Googlebot with robots.txt, then you add "robots=noindex" to the page itself, the page will remain in the Google Index!
Yes bumpski, you've brought out the catch22-like situation quite effectively.
If we put a 'meta noindex' tag on an already indexed page AND simultaneously block the bot with robots.txt, the indexed page will probably never get de-indexed.
If one wants to use robotx.txt in a situation like this, I guess a good solution would be to place meta noindex on the page(s) first and wait for it to get removed before blocking the bot via robots.txt.
Anyone care to confirm this please?
BTW, all this is very informative and interesting. Thanks for participating, everyone.