Welcome to WebmasterWorld Guest from 54.166.178.177

Forum Moderators: goodroi

Message Too Old, No Replies

Will URLs blocked in robots.txt remove currently indexed pages?

     
5:53 am on Aug 31, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We messed up a little when creating our custom CMS. We have around 2,000 pages, but many of them have been indexed by Google twice via different URLs.

We have since fixed the problem and also updated our robots.txt to tell Google not to index certain URLs.

Will this addition to our robots.txt cause the currently indexed pages that we don't want indexed to be removed by Google automatically?

There are least 500+ pages that we need to get removed and if we have to do them one by one using the URL removal tool in Webmasters tools it will take FOREVER.

I figured adding the strings to our robots.txt would cause the currently indexed pages to be removed, but it's been 3 days and nothing yet. Google spiders our site 1,000s of times per day so I figured they would be removed by now ...

11:26 am on Aug 31, 2009 (gmt 0)

WebmasterWorld Administrator goodroi is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



You need Google to reaccess those pages. Google may be visiting you each day but looking at different parts of your site. I would not be too worried about this and just wait till it cleans itself up naturally.
12:26 pm on Sep 1, 2009 (gmt 0)

5+ Year Member



You are do it in wrong way, as those pages has been indexed already. The robots.txt just prevents the search engines bots, so it doesn't valid to remove the page. In correct way, the pages should be accessed by search engine bots and then use some tags to tell them:" hi, i am the webmaster of the site. I don't want you to indexed the pages! Please remove them even it has been indexed already."

IMO, there are two ways to archive it.
In some special case(For example, the pages has similar content generated by session id), you maybe are able to use tag rel="canical" in the head of the page you wanna remove to tell SEARCH ENGINEs which is the authoritative page they should be indexed, if the two pages has similar contents.

The seconds one is that use a noindex meta tag. It's simple!

The two ywas resolve the trouble in different aspects. Just remember that you can't use robots.txt to remove the duplicate pages as they have been indexed.

6:14 am on Sep 2, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for the replies. I've read conflicting advice Blan - many say that blocking the URLs in robots.txt WILL also get the already indexed pages removed.

The problem with either of your solutions above is that it would be an extremely labor intensive process as we would have to do that manually for over 500+ URLs. Surely there's an easier way ...

12:15 am on Oct 28, 2009 (gmt 0)

WebmasterWorld Administrator 5+ Year Member Top Contributors Of The Month



Why cannot you use 301 redirect to the correct URL? In that way you could perhaps use pattern matching?

In that case you should not block these URLs in robots.txt and after some time Google will drop incorrect URLs from its index.

12:20 am on Oct 28, 2009 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



301 redirect to the correct URLs using a pattern-matching directive such as RedirectMatch or RewriteRule (Apache mod_alias and mod_rewrite, respectively) or a script, or...
Return a 410-Gone status, again using either of those pattern-matching directives or a script.

Leave the solution in place for many months, if not years.

Jim

 

Featured Threads

Hot Threads This Week

Hot Threads This Month