Will URLs blocked in robots.txt remove currently indexed pages?

Forum Moderators: goodroi

Message Too Old, No Replies

Will URLs blocked in robots.txt remove currently indexed pages?

limitup

5:53 am on Aug 31, 2009 (gmt 0)

We messed up a little when creating our custom CMS. We have around 2,000 pages, but many of them have been indexed by Google twice via different URLs.

We have since fixed the problem and also updated our robots.txt to tell Google not to index certain URLs.

Will this addition to our robots.txt cause the currently indexed pages that we don't want indexed to be removed by Google automatically?

There are least 500+ pages that we need to get removed and if we have to do them one by one using the URL removal tool in Webmasters tools it will take FOREVER.

I figured adding the strings to our robots.txt would cause the currently indexed pages to be removed, but it's been 3 days and nothing yet. Google spiders our site 1,000s of times per day so I figured they would be removed by now ...

goodroi

11:26 am on Aug 31, 2009 (gmt 0)

You need Google to reaccess those pages. Google may be visiting you each day but looking at different parts of your site. I would not be too worried about this and just wait till it cleans itself up naturally.

Blan

12:26 pm on Sep 1, 2009 (gmt 0)

You are do it in wrong way, as those pages has been indexed already. The robots.txt just prevents the search engines bots, so it doesn't valid to remove the page. In correct way, the pages should be accessed by search engine bots and then use some tags to tell them:" hi, i am the webmaster of the site. I don't want you to indexed the pages! Please remove them even it has been indexed already."

IMO, there are two ways to archive it.
In some special case(For example, the pages has similar content generated by session id), you maybe are able to use tag rel="canical" in the head of the page you wanna remove to tell SEARCH ENGINEs which is the authoritative page they should be indexed, if the two pages has similar contents.

The seconds one is that use a noindex meta tag. It's simple!

The two ywas resolve the trouble in different aspects. Just remember that you can't use robots.txt to remove the duplicate pages as they have been indexed.

limitup

6:14 am on Sep 2, 2009 (gmt 0)

Thanks for the replies. I've read conflicting advice Blan - many say that blocking the URLs in robots.txt WILL also get the already indexed pages removed.

The problem with either of your solutions above is that it would be an extremely labor intensive process as we would have to do that manually for over 500+ URLs. Surely there's an easier way ...

aakk9999

12:15 am on Oct 28, 2009 (gmt 0)

Why cannot you use 301 redirect to the correct URL? In that way you could perhaps use pattern matching?

In that case you should not block these URLs in robots.txt and after some time Google will drop incorrect URLs from its index.

jdMorgan

12:20 am on Oct 28, 2009 (gmt 0)

301 redirect to the correct URLs using a pattern-matching directive such as RedirectMatch or RewriteRule (Apache mod_alias and mod_rewrite, respectively) or a script, or...
Return a 410-Gone status, again using either of those pattern-matching directives or a script.

Leave the solution in place for many months, if not years.

Jim