homepage Welcome to WebmasterWorld Guest from 54.161.175.231
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Will URLs blocked in robots.txt remove currently indexed pages?
limitup




msg:3981038
 5:53 am on Aug 31, 2009 (gmt 0)

We messed up a little when creating our custom CMS. We have around 2,000 pages, but many of them have been indexed by Google twice via different URLs.

We have since fixed the problem and also updated our robots.txt to tell Google not to index certain URLs.

Will this addition to our robots.txt cause the currently indexed pages that we don't want indexed to be removed by Google automatically?

There are least 500+ pages that we need to get removed and if we have to do them one by one using the URL removal tool in Webmasters tools it will take FOREVER.

I figured adding the strings to our robots.txt would cause the currently indexed pages to be removed, but it's been 3 days and nothing yet. Google spiders our site 1,000s of times per day so I figured they would be removed by now ...

 

goodroi




msg:3981188
 11:26 am on Aug 31, 2009 (gmt 0)

You need Google to reaccess those pages. Google may be visiting you each day but looking at different parts of your site. I would not be too worried about this and just wait till it cleans itself up naturally.

Blan




msg:3981899
 12:26 pm on Sep 1, 2009 (gmt 0)

You are do it in wrong way, as those pages has been indexed already. The robots.txt just prevents the search engines bots, so it doesn't valid to remove the page. In correct way, the pages should be accessed by search engine bots and then use some tags to tell them:" hi, i am the webmaster of the site. I don't want you to indexed the pages! Please remove them even it has been indexed already."

IMO, there are two ways to archive it.
In some special case(For example, the pages has similar content generated by session id), you maybe are able to use tag rel="canical" in the head of the page you wanna remove to tell SEARCH ENGINEs which is the authoritative page they should be indexed, if the two pages has similar contents.

The seconds one is that use a noindex meta tag. It's simple!

The two ywas resolve the trouble in different aspects. Just remember that you can't use robots.txt to remove the duplicate pages as they have been indexed.

limitup




msg:3982481
 6:14 am on Sep 2, 2009 (gmt 0)

Thanks for the replies. I've read conflicting advice Blan - many say that blocking the URLs in robots.txt WILL also get the already indexed pages removed.

The problem with either of your solutions above is that it would be an extremely labor intensive process as we would have to do that manually for over 500+ URLs. Surely there's an easier way ...

aakk9999




msg:4014609
 12:15 am on Oct 28, 2009 (gmt 0)

Why cannot you use 301 redirect to the correct URL? In that way you could perhaps use pattern matching?

In that case you should not block these URLs in robots.txt and after some time Google will drop incorrect URLs from its index.

jdMorgan




msg:4014614
 12:20 am on Oct 28, 2009 (gmt 0)

301 redirect to the correct URLs using a pattern-matching directive such as RedirectMatch or RewriteRule (Apache mod_alias and mod_rewrite, respectively) or a script, or...
Return a 410-Gone status, again using either of those pattern-matching directives or a script.

Leave the solution in place for many months, if not years.

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved