|Disallow: /*?virtuemart=* not respected by google?|
| 11:01 am on May 22, 2009 (gmt 0)|
I am struggeling with a duplicate content issue caused by a site parameter, I noticed this at a late stage, at this point we already had thousands of these pages indexed by google.
On 11 May, using webmaster tools I added the following rule to robots.txt: Disallow: /*?virtuemart=*
This checked out as valid by robots.txt checker tool in webmaster tools, since then I have seen an increase in blocked URL's listed in webmaster tools(tho not even 10% of total), as expected. However, google keeps adding pages to its index including the 'banned' parameter. Here's an overview:
result: 1 - 10 13.000
result: 1 - 10 13.700
What I fail to understand is, why did google add another 700 pages to its index _after_ I added the Disallow rule?
How do I get rid of all these duplicate pages? Our site only has 629 pages.. I'm lost to why this is happening.
ps: I have added a .htacces rewrite as of 18-may-2009, which strips the url of its parameter (even tho the Disallow rule should keep google from indexing, I'm just trying all I can here..)
[edited by: SyntaxTerror at 11:07 am (utc) on May 22, 2009]
[edited by: goodroi at 12:15 pm (utc) on May 22, 2009]
[edit reason] Please no urls [/edit]
| 12:21 pm on May 22, 2009 (gmt 0)|
Your robots.txt is correct and Google will need time to remove those URLs. I do not think Google did not respect robots.txt.
When Google states the total number of results they are estimating. This number is known to be inaccurate. It will be close but it sometimes it is very hard for Google to know how many total URLs it has. The number was reported as 13,000 and then later reported as 13,700 I think this is more a problem of Google not reporting a good estimate.
If you review your log files you can determine if Google is following your robots.txt.
| 1:20 pm on May 22, 2009 (gmt 0)|
Hello Goodroi, firstly let my apologize for allowing URL's to sneak in, I copy pasted this from my research-on-the-matter-document ;)
I have opened the log files of the last days and you where correct, google is not visiting any of the pages I blocked.
I had another measuring point in between the other two which resulted in 13.300 pages, so I saw a rising trend so to speak, and got alarmed.
Am I correct in assuming the pages already indexed will eventually drop out of the index naturally?
Thank you :)
| 7:44 pm on May 22, 2009 (gmt 0)|
Your robots.txt rule should have the final * deleted. It is not needed. The URLs in the rules are "matched from the left", so a wildcard is only needed "on the left" or "in the middle", never "on the right".
Those entries will remain in Google SERPs as URL-only entries forever, because your disallow forbids them being spidered.
If you want them to be gone you would be much better off either:
- using meta robots noindex tag on each page, or,
- having no disallow, and simply redirecting such requests (DO make sure that it really is a 301 redirect).
| 11:41 am on May 23, 2009 (gmt 0)|
ok, I have that, a redirect set up using .htaccess which passes header code 301. Tho my robots.txt file is preventing google from now finding that 301 redirect.
If I remove the disallow, there will be a flood of 301 redirects to google suddenly, do I risk getting a penalty for this? the redirect points to the same url, minus the parameter in the url.
| 6:34 pm on May 23, 2009 (gmt 0)|
No penalty. You're cleaning house.
While those listings do still remain in SERPs they will bring visitors and your redirect will deliver them to the right content, and their browser will display the correct URL for that content.
| 11:53 am on May 24, 2009 (gmt 0)|
Thanks g1smd, I've gone ahead and allowed google to spider the pages, so the redirect can be picked up! I'm keeping an eye on what happens, anything noteworthy I will post here.