| 9:58 pm on Jan 9, 2013 (gmt 0)|
I'm not sure I understand your question, but as a general rule I think you need to either delete (404 or 410) or noindex the pages before you submit them to Google's removal tool. After they are submitted to the tool, in my experience Googlebot usually crawls them within a few hours, and they are gone from the index within 24 hours.
| 10:33 pm on Jan 9, 2013 (gmt 0)|
we can easily robot out the /dir/ and delete it within a few hours, thats not the problem.
This /dir/ directory has 2,000 URLs we want re-crawling immediately after removing it, it is collateral damage for removing the duplicate content, so I want to know whether Google will re-crawl them soon after removing the "disallow" in robots, and secondly will these re-crawled URLs pass the same amount of link juice onto others now that they have been deleted and re-indexed?
| 10:39 pm on Jan 9, 2013 (gmt 0)|
I'm pretty sure the URL removal tool removes them for 90 days minimum, so I doubt if I'd go that route ... I think I'd just get the redirects right from the duplicates to the pages you actually want to keep and don't 'over do' things or 'over manage' it.
You have duplicates ... They've seen that for years ... Generally, they'll just 'pick a version' to use if there's not a redirect or canonical link relationship to 'give them an indication' of which should be used as the canonical.
It may be a bit different since the duplication is across subdomains, rather than a single domain, but that's also the 'technical case' when a domain is available both with and without the www, so I think I'd just get it right with the 301 redirects and not 'get fancy' with it or 'over do' trying to fix it.
| 10:57 pm on Jan 9, 2013 (gmt 0)|
A Disallow: directive in the robots.txt is not necessarily going to get them removed from Google's index any time soon. It will simply mean that they will no longer crawl them. A meta robots noindex would get them removed but does not seem appropriate since there are so many.
I would probably just 301 redirect the 8000 or so URLs since they were likely only linked to from your site (and I'm assuming all such links have been fixed).
Hopefully, you can do some pattern matching in Mod_Rewrite/.htaccess so that you don't need 8000 page-by-page redirects.
| 11:49 pm on Jan 9, 2013 (gmt 0)|
If the 90 day minimum removal stands this may be a show stopper because the non dupe urls have a canonical to the correct place on other subdomains - without google re-crawling, indexing and honouring the canonical it would stop the link juice flowing and could affect ranking...the reason for opting for the removal tool is to resolve this issue quickly - as mentioned most of our traffic went 950 and we have proof that its because of this cross subdomain dupe content...
We have dropped the 301 redirects because somehow google has managed to get into this mess
| 11:55 pm on Jan 9, 2013 (gmt 0)|
|A Disallow: directive in the robots.txt is not necessarily going to get them removed from Google's index any time soon. It will simply mean that they will no longer crawl them. A meta robots noindex would get them removed but does not seem appropriate since there are so many. |
When used in conjunction with the removal tool, it's immediate ... You either have to put a block in the robots.txt or a noindex on the pages to use the tool successfully.
|We have dropped the 301 redirects because somehow google has managed to get into this mess |
I'd double check on the 90 minimum. I haven't looked into it in a while, but that's what it used to be.
I would guess there was some issue with the redirecting previously to have had issues, because, except for having a larger 'not selected' list the URLs would not be accessible, except for the canonical version, so there should not have been duplication ... Maybe I'm missing something or what the initial issue was causing the 301s to be removed?
| 12:13 am on Jan 10, 2013 (gmt 0)|
In hindsight we should have placed a canonical in addition to the 301 to make sure we were covered - this 90 day limit looks for real, so looks like we are going to have to sit tight while google gets through these urls... I am shocked google has 950d us based on dupe content, but it may have tipped us over the edge...
| 12:19 am on Jan 10, 2013 (gmt 0)|
That is a bit interesting, because from what I understand about the -950 it was more 'word/structure' oriented, but applied differently to different pages/site/topics ... I'd maybe dig through that a bit more while waiting for Google to get through the pages to see if there's anything that 'sticks' as possible being an issue besides the duplication.
| 12:24 am on Jan 10, 2013 (gmt 0)|
It's not the first time we have met 950, obviously we have over-opt that google can just about cope with, but with this added dupe variable it's had enough - one glaring bit of proof was the fact that subdomains that didn't have any cross duplication remained active...
| 1:56 am on Jan 10, 2013 (gmt 0)|
|When used in conjunction with the removal tool, it's immediate ... You either have to put a block in the robots.txt or a noindex on the pages to use the tool successfully. |
I'm well aware that before using the URL removal tool, you need to block them with a Disallow: in the robots.txt or a meta robots noindex on the pages so that they won't get reindexed.
I was simply clarifying that simply blocking them with a robots.txt Disallow: directive will NOT get the URLs dropped from Google's index immediately. This would have to be done in conjunction with using the URL removal tool if the goal is to remove them immediately.