|Joomla duplicate content issue|
I had the 1.5 version of joomla in march 2013 and in the google webmaster tool google was indicating 39 pages indexed. In april 2013 our webmaster upgraded joomla to the 2.5 version ( we changed nothing to our website ) other than an upgrade to the latest stable version ).
Within a week of installing that version google had 600 pages indexed in the webmaster tool
that look like that one : index.php?option=com_content&view=article&id=75&Itemid=261 or that one www.example.com/example/159-news/latest-news/66-the-example.html and it has nothing to do in google indexed ( we had the url rewriting in place etc... and never created those pages )
When we saw those pages index by google ( how did we see it ? by typing the command site:exampe.com, google was willing to show us a few of those page ) we decided to put back the 1.5 version of joomla because we realized right away that there was a bug with the upgrade but unfortunately the damage was done. Google had those pages in its index due to a bug with our CMS upgrade.
Since then our goal has been to remove those url from the index of google because we have duplicate content of all our pages. which is killing our seo. We disappeared from the rankings on all our keywords. To give you an example on certain keywords we were on the 2 page and within a few weeks of that issue went to page 70 !
Right now we are trying to remove all the duplicate content pages one by one and in the google webmaster tool but we still have 350 pages listed ( instead over over 600 which means that we managed to remove some of those those with the url removal tool but we have a very hard time figuring our what the address of the webpages to remove is.
Google will not list those when we type the site:example.com command !The inurl command doesn't give a list of those page either so we really have to guess and guessing over 300 pages address to remove is impossible.
How we did it at the beginning is in the URL Parameter google was showing us a list of samples and when we typed the site:example.com we saw a few also listed and then did a few guess ... but we are now stuck.
The question I have is how can we remove the 300 pages that google has in it index that are duplicate content and which is creating the penalty we have. So far we have added the Disallow / * ( for number going from 0 to 9 ) in the robots.txt, we are using the URL Parameter in the GWT and have told google NO URL for all the itemid it found but we have no clue if it going to work...
Is there anything else we could do ? and if we are on the right track what is the delay to remove those pages for google index.
Thank you for your help and comments,
[edited by: ergophobe at 3:47 pm (utc) on Aug 30, 2013]
[edit reason] domain exemplified [/edit]
|So far we have added the Disallow |
DO NOT DO THIS.
If a page is already indexed, the last thing you ever want to do is block robots from re-crawling it. If they can't crawl it, they will never see a noindex directive.
At this point you have at least three separate issues.
#1 Duplicate pages that are being created by the CMS. It seems to be returning valid pages from invalid parameter values. For every upgrade, there will be one or more transitional patches to keep URLs from going haywire. You are in the right forum to ask about this.
#2 Duplicate URLs that already exist and might be requested by anyone, including search engines. These need to be redirected in htaccess to the appropriate unique pages. It isn't easy, but you can get htaccess to look at value of parameters. For example if "options" is only valid for values up to 3, you'd have a form like
#3 Duplicate pages in google's index. If the problem is with parameter values, rather than parameter names, you'll have to concentrate on redirecting. Once the search engines see that URLs b through g all redirect to URL a, the rest of the list will disappear from the index.
Three problems, three solutions.
[edited by: phranque at 9:50 am (utc) on Aug 30, 2013]
[edit reason] disabled graphic smileys [/edit]
|If a page is already indexed, the last thing you ever want to do is block robots from re-crawling it. If they can't crawl it, they will never see a noindex directive. |
Since I put the Disallow googlebot blocked the pages but it is not an issue because there is no no index directive on my page ( I cannot add one on each page because I don't have a list of pages to block that is my problem ! )
#1 Duplicate pages that are being created by the CMS.
This is not happening anymore because I reinstalled 1.5 and will never use 2.5 until joomla figures out a way to create static web address ( hopefully 3.5 )
# 2 Same it is impossible to redirect the page because how can I know that a page with an number 1 needs to be redirected to a certain page B. Maybe the next page with a number 1 in it needs to be redirect to page C so unless I misunderstood this doesn't seem to be a possibility.
# 3 Same I can't guarantee that page B thru G will need to redirect to URL A as I don't have a list of the pages to redirect... google has those pages in its index but is hidding it from me !
[edited by: ergophobe at 2:07 pm (utc) on Aug 30, 2013]
[edit reason] formatting [/edit]
If you don't know where to redirect to, how do you know what to block? You must have some rule that allows you to figure that out.
Which brings us back to Lucy24's point. This is fundamental
robots.txt controls the crawl, not the index.
Disallow in robots.txt doesn't tell Google not to index a page, it tells it not to crawl it. If you have a enough inbound links to that page Google might choose to index it and would be fully within compliance with robots.txt protocols to do so. Disallow is more about not wasting server resources, not spending your crawl budget and so forth. It is only secondarily and in a sense accidentally about what's in the index.
noindex controls the index, not the crawl
If you noindex a page, on the other hand, you are telling Google to keep it out of the index or remove it from the index, but Google is welcome to crawl it all they want.
So if you are trying to control duplicate content, robots.txt is NOT generally your tool of choice.
In this particular case, though, it sounds like your primary problem is with canonicalization. So rather than add a noindex to the pages that have extra parameters, you probably want to add a rel=canonical to those pages and point to the primary page.
|google has those pages in its index but is hidding it from me ! |
If they are hidden from you, how do you know they are in the index? In any case, the more important question is why do you need to know they are in the index? What you really need to figure out is what your URL structure is, what you need for redirects (if any) and canonical tags.
In other words, whether Google has spotted a dupe content issue already (thus reflected in the index) or not (thus not YET reflected in the index), you need to figure out the root causes and fix that.
Telling Google to ignore certain parameters in GWT is a good strategy. rel=noindex, rel=canonical and 301 redirects might be good strategies too. Robots.txt disallow probably isn't going to do what you want.
Then disallow really doesn't make sense there. If the page doesn't exist, then ideally it will serve up a 410 (permanently removed, not a Not Found like 404) and that is a signal to pull it from the index. If you disallow, the SEs can never know the page no longer exists.
[edited by: ergophobe at 3:50 pm (utc) on Aug 30, 2013]
I went through this upgrade in April too and my problems were minimal. You need to solve them in 2.5 because 3.5 will present the same challenges. Going into the conversion, I had a list of like 50 items I needed to check after the conversion. The below type of URI was on the list. I used SEF Advance, which automatically redirects the below to the SEF URL. Look around there are easy to institute solutions to this problem.
Of course Lucy is correct too. Once you're sure the server is going to return a 301 or 404 / 410 (whatever is appropriate), don't block access. Let Google know their redirected, bad or gone.
Thanks BillyS - I don't use Joomla so I only know about SEF Advance from other posts here. But if you can solve it at the source in Joomla 2.5, that's best of all (and as you say, Lucy24's advice still stands).
Thank you for your replies and I agree with you robots.txt it not the best and it is also the slowest... ( it has currently taken about 3 months to start showing the results of the disallow in the description of the pages.
I will setup a 410 in the .htaccess as it is the best in my case and I guess that the pages I want removed will be removed in a few days.
When you did the upgrade, I assume all the pages took on new URLs. You needed a redirect from the old to the new URL in order to preserve any traffic still requesting the old URL.
I tried one of the seo/dupe content extensions. it is POWERFUL.. However, it is also extremely complex and does a lot of url re-writing. To be honest, I'm not sure if it is helping or hurting the 2 sites I installed it on.