homepage Welcome to WebmasterWorld Guest from 54.161.147.106
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Disallow: /*?virtuemart=* not respected by google?
SyntaxTerror




msg:3918411
 11:01 am on May 22, 2009 (gmt 0)

I am struggeling with a duplicate content issue caused by a site parameter, I noticed this at a late stage, at this point we already had thousands of these pages indexed by google.

On 11 May, using webmaster tools I added the following rule to robots.txt: Disallow: /*?virtuemart=*

This checked out as valid by robots.txt checker tool in webmaster tools, since then I have seen an increase in blocked URL's listed in webmaster tools(tho not even 10% of total), as expected. However, google keeps adding pages to its index including the 'banned' parameter. Here's an overview:

11-may-2009
result: 1 - 10 13.000

22-may-2009
result: 1 - 10 13.700

What I fail to understand is, why did google add another 700 pages to its index _after_ I added the Disallow rule?

How do I get rid of all these duplicate pages? Our site only has 629 pages.. I'm lost to why this is happening.

ps: I have added a .htacces rewrite as of 18-may-2009, which strips the url of its parameter (even tho the Disallow rule should keep google from indexing, I'm just trying all I can here..)

[edited by: SyntaxTerror at 11:07 am (utc) on May 22, 2009]

[edited by: goodroi at 12:15 pm (utc) on May 22, 2009]
[edit reason] Please no urls [/edit]

 

goodroi




msg:3918447
 12:21 pm on May 22, 2009 (gmt 0)

Your robots.txt is correct and Google will need time to remove those URLs. I do not think Google did not respect robots.txt.

When Google states the total number of results they are estimating. This number is known to be inaccurate. It will be close but it sometimes it is very hard for Google to know how many total URLs it has. The number was reported as 13,000 and then later reported as 13,700 I think this is more a problem of Google not reporting a good estimate.

If you review your log files you can determine if Google is following your robots.txt.

SyntaxTerror




msg:3918469
 1:20 pm on May 22, 2009 (gmt 0)

Hello Goodroi, firstly let my apologize for allowing URL's to sneak in, I copy pasted this from my research-on-the-matter-document ;)

I have opened the log files of the last days and you where correct, google is not visiting any of the pages I blocked.

I had another measuring point in between the other two which resulted in 13.300 pages, so I saw a rising trend so to speak, and got alarmed.

Am I correct in assuming the pages already indexed will eventually drop out of the index naturally?

Thank you :)

g1smd




msg:3918734
 7:44 pm on May 22, 2009 (gmt 0)

Your robots.txt rule should have the final * deleted. It is not needed. The URLs in the rules are "matched from the left", so a wildcard is only needed "on the left" or "in the middle", never "on the right".

Those entries will remain in Google SERPs as URL-only entries forever, because your disallow forbids them being spidered.

If you want them to be gone you would be much better off either:
- using meta robots noindex tag on each page, or,
- having no disallow, and simply redirecting such requests (DO make sure that it really is a 301 redirect).

SyntaxTerror




msg:3919018
 11:41 am on May 23, 2009 (gmt 0)

ok, I have that, a redirect set up using .htaccess which passes header code 301. Tho my robots.txt file is preventing google from now finding that 301 redirect.

If I remove the disallow, there will be a flood of 301 redirects to google suddenly, do I risk getting a penalty for this? the redirect points to the same url, minus the parameter in the url.

g1smd




msg:3919110
 6:34 pm on May 23, 2009 (gmt 0)

No penalty. You're cleaning house.

While those listings do still remain in SERPs they will bring visitors and your redirect will deliver them to the right content, and their browser will display the correct URL for that content.

SyntaxTerror




msg:3919326
 11:53 am on May 24, 2009 (gmt 0)

Thanks g1smd, I've gone ahead and allowed google to spider the pages, so the redirect can be picked up! I'm keeping an eye on what happens, anything noteworthy I will post here.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved