Welcome to WebmasterWorld Guest from 18.104.22.168
Forum Moderators: goodroi
On 11 May, using webmaster tools I added the following rule to robots.txt: Disallow: /*?virtuemart=*
This checked out as valid by robots.txt checker tool in webmaster tools, since then I have seen an increase in blocked URL's listed in webmaster tools(tho not even 10% of total), as expected. However, google keeps adding pages to its index including the 'banned' parameter. Here's an overview:
result: 1 - 10 13.000
result: 1 - 10 13.700
What I fail to understand is, why did google add another 700 pages to its index _after_ I added the Disallow rule?
How do I get rid of all these duplicate pages? Our site only has 629 pages.. I'm lost to why this is happening.
ps: I have added a .htacces rewrite as of 18-may-2009, which strips the url of its parameter (even tho the Disallow rule should keep google from indexing, I'm just trying all I can here..)
[edited by: SyntaxTerror at 11:07 am (utc) on May 22, 2009]
[edited by: goodroi at 12:15 pm (utc) on May 22, 2009]
[edit reason] Please no urls [/edit]
When Google states the total number of results they are estimating. This number is known to be inaccurate. It will be close but it sometimes it is very hard for Google to know how many total URLs it has. The number was reported as 13,000 and then later reported as 13,700 I think this is more a problem of Google not reporting a good estimate.
If you review your log files you can determine if Google is following your robots.txt.
I have opened the log files of the last days and you where correct, google is not visiting any of the pages I blocked.
I had another measuring point in between the other two which resulted in 13.300 pages, so I saw a rising trend so to speak, and got alarmed.
Am I correct in assuming the pages already indexed will eventually drop out of the index naturally?
Thank you :)
Those entries will remain in Google SERPs as URL-only entries forever, because your disallow forbids them being spidered.
If you want them to be gone you would be much better off either:
- using meta robots noindex tag on each page, or,
- having no disallow, and simply redirecting such requests (DO make sure that it really is a 301 redirect).
If I remove the disallow, there will be a flood of 301 redirects to google suddenly, do I risk getting a penalty for this? the redirect points to the same url, minus the parameter in the url.