I have a website that provides listings throughout the US. Back when I was first building the site about three years ago, I created separate category pages for categories that were very similar to each other.
So, for every city and zip code in the US we provide listings under 5 different categories, and three of those categories are very similar. Basically the 3 categories are the same but the term used varies greatly depending on the region of the country you're from.
Up until recently we were getting a high volume of traffic from each category. But now I think these 3 similar categories are being considered duplicate content, or at minimum cannibalizing each other in G SERPS.
I believe this is true because, besides the major loss of traffic, now when searching for any of the three category keywords the remaining two will get highlighted by G if they appear in the title/desc. of any results displayed.
My question is what's the best strategy for removing two of the categories from the index and setting the third as the version to index? I still want to keep the two being removed from index available to maintain user experience.
I'm thinking the best strategy is to set a canonical tag on the two categories being removed to the one category remaining in index.
Would I also update the robots tag to NOINDEX or 'NOINDEX, NOFOLLOW'?
Also, there are about 70,000 pages on the site per category. So this update will possibly be no-indexing/manipulating 140,000 pages on my site. With my site already down in the SERPS it's vital that I don't screw this up!
Would you perform this on a small test section first(I'm thinking yes now as I ask this)?
I agree with your plan to implement canonical link element on two city/zip category pages to point to the third category that you choose to be indexed.
I would not set up robots meta noindex. The result of implementing canonical is that pages with canonical pointing to a different URL will be dropped from the index anyway.
I would agree with your thoughts of testing this with one small section of the website first, but I would choose a selection of zip codes/cities that get crawled more often, and I would monitor my logs for the chosen URLs to see if they have been re-crawled before drawing any conclusions. This is because with such a big number of pages, if you chose to implement this on pages not crawled often, you may have to wait a long time to see the results and if you do not monitor whether Google re-crawled these pages, you may end up drawing a wrong conclusion.
It sounds like you have massive amounts of questionable quality and near duplicate pages. You can fix this one case (and I agree with what aakk9999 said) but you might want to take a step back and look at what unique value you provide because you are probably going to have many more issues with Google traffic.