Msg#: 3981036 posted 5:53 am on Aug 31, 2009 (gmt 0)
We messed up a little when creating our custom CMS. We have around 2,000 pages, but many of them have been indexed by Google twice via different URLs.
We have since fixed the problem and also updated our robots.txt to tell Google not to index certain URLs.
Will this addition to our robots.txt cause the currently indexed pages that we don't want indexed to be removed by Google automatically?
There are least 500+ pages that we need to get removed and if we have to do them one by one using the URL removal tool in Webmasters tools it will take FOREVER.
I figured adding the strings to our robots.txt would cause the currently indexed pages to be removed, but it's been 3 days and nothing yet. Google spiders our site 1,000s of times per day so I figured they would be removed by now ...
Msg#: 3981036 posted 11:26 am on Aug 31, 2009 (gmt 0)
You need Google to reaccess those pages. Google may be visiting you each day but looking at different parts of your site. I would not be too worried about this and just wait till it cleans itself up naturally.
You are do it in wrong way, as those pages has been indexed already. The robots.txt just prevents the search engines bots, so it doesn't valid to remove the page. In correct way, the pages should be accessed by search engine bots and then use some tags to tell them:" hi, i am the webmaster of the site. I don't want you to indexed the pages! Please remove them even it has been indexed already."
IMO, there are two ways to archive it. In some special case(For example, the pages has similar content generated by session id), you maybe are able to use tag rel="canical" in the head of the page you wanna remove to tell SEARCH ENGINEs which is the authoritative page they should be indexed, if the two pages has similar contents.
The seconds one is that use a noindex meta tag. It's simple!
The two ywas resolve the trouble in different aspects. Just remember that you can't use robots.txt to remove the duplicate pages as they have been indexed.
Msg#: 3981036 posted 12:20 am on Oct 28, 2009 (gmt 0)
301 redirect to the correct URLs using a pattern-matching directive such as RedirectMatch or RewriteRule (Apache mod_alias and mod_rewrite, respectively) or a script, or... Return a 410-Gone status, again using either of those pattern-matching directives or a script.
Leave the solution in place for many months, if not years.