joined:Oct 8, 2003
Ours is a fairly popular blog covering lifestyle and fashion. Our blog was running on Movable Type (a CMS) which we then shifted to Wordpress in November of 2012 with a brand new design.
I believe the problem of duplicate content started then. Its a bit lengthy so please take the time to go through it.
Movable Type paginates by adding a query to the index page. So for eg - it would be mywebsite.com/index.php?page=2 mywebsite.com/cars/index.php?page=35. We have close to 14000 posts so you can imagine the paginated pages across the categories and the home page. Wordpress on the other hand paginates in a different way. It is www.mywebsite.com/page/2/ and www.mywebsite.com/cars/page/35/.
Since we had been using Movable Type for close to 7 years now, the paginated url's are stored in Google's cache. So now Google is mixing both the parameters and crawling the pages for eg - it would crawl - www.mywebsite.com/page/2?page=2 and www.mywebsite.com/page/35/?page=20 and so forth. The combination and permutations for these run into hundreds of thousands. In the Google Webmaster Tools for this website in the URL's monitored section there are close to 110,000 url's monitored by Google for the variable page only.
Apart from this there are many incoming links from other websites like Polyvore and Pintrest who have linked to the old paginated pages.
Clearly Google does treats this as duplicate content (I can tell from the drop in serps) and because of the large number of rogue url's it is not crawling my website as it used to.
To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/, apart from this I have also added canonical references to all the pages.
It has been three months since I have done this, but yet the number of URL's monitored is the same on GWT and the serps are in the dump as well. Serps is the secondary issue but my primary concern is for Google to remove these pages from its index and cache. I have tried to remove these links manually too but the requests expire.
What is that I should do? I would want Google to crawl all the pages indexed once again so it can know a majority of them have been 301'd.