Msg#: 4578166 posted 10:53 am on May 27, 2013 (gmt 0)
Ours is a fairly popular blog covering lifestyle and fashion. Our blog was running on Movable Type (a CMS) which we then shifted to Wordpress in November of 2012 with a brand new design.
I believe the problem of duplicate content started then. Its a bit lengthy so please take the time to go through it. Movable Type paginates by adding a query to the index page. So for eg - it would be mywebsite.com/index.php?page=2 mywebsite.com/cars/index.php?page=35. We have close to 14000 posts so you can imagine the paginated pages across the categories and the home page. Wordpress on the other hand paginates in a different way. It is www.mywebsite.com/page/2/ and www.mywebsite.com/cars/page/35/.
Since we had been using Movable Type for close to 7 years now, the paginated url's are stored in Google's cache. So now Google is mixing both the parameters and crawling the pages for eg - it would crawl - www.mywebsite.com/page/2?page=2 and www.mywebsite.com/page/35/?page=20 and so forth. The combination and permutations for these run into hundreds of thousands. In the Google Webmaster Tools for this website in the URL's monitored section there are close to 110,000 url's monitored by Google for the variable page only.
Apart from this there are many incoming links from other websites like Polyvore and Pintrest who have linked to the old paginated pages.
Clearly Google does treats this as duplicate content (I can tell from the drop in serps) and because of the large number of rogue url's it is not crawling my website as it used to.
To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/, apart from this I have also added canonical references to all the pages.
It has been three months since I have done this, but yet the number of URL's monitored is the same on GWT and the serps are in the dump as well. Serps is the secondary issue but my primary concern is for Google to remove these pages from its index and cache. I have tried to remove these links manually too but the requests expire.
What is that I should do? I would want Google to crawl all the pages indexed once again so it can know a majority of them have been 301'd.
Msg#: 4578166 posted 1:23 pm on May 27, 2013 (gmt 0)
To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/
That can be a problem. Your server needs to return a 404 for any URL that never existed. Sites definitely lose rankings over time when they redirect everything instead of returning a true 404 where it's appropriate. The first sign is often a WMT warning about "soft 404s".
Other than that, a 301 redirect from the old style paginated URL to the new style is exactly what you should do. Kudos for that.
Msg#: 4578166 posted 2:06 pm on May 27, 2013 (gmt 0)
Well I have got a Soft 404 warning a couple of days back. The problem was I had not used absolute url's for pagination. So if a user entered a query (before I set up the 301) mywebsite.com/page/2/query=webmasterworld all the pagination url's would point to their respective pages with that query. Some hackers had actually tagged rogue queries like Save Us from Berlusconi and some with random characters. I have solved that by modifying the plugin to display absolute url's. But not before Google has indexed and cached the pages with the rogue queries.
Msg#: 4578166 posted 1:18 am on Oct 23, 2013 (gmt 0)
Morpheus / Coleman123,
Were you able to recover from this issue? I've been dealing with a big G drop for the past few months and had similar findings. G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't. And, Polyvore just like your example above, is showing many old URLs.
I'm curious to know how you wound up resolving the issue if you were able to get past this.
Msg#: 4578166 posted 1:44 am on Oct 23, 2013 (gmt 0)
My issue was a result of multiple errors on my part. My programmer and I launched a new version of my website on May 1st and we forgot to update the robots.txt file for about 24 hours. That in itself wasn't a massive problem, but the new version of the site was setup with updated URL's on about half the site. Among several other issues along the way, during the 3rd or 4th day we had to revert to old version of site because new version was crashing. Quite the learning experience... Oh by the way, this was all happening right before/during some major algorithm updates from Google, so I was slapped hard.
For your issue:
"G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't."
Are you saying you setup a 301 redirect two years ago and URL's updated as planned, but suddenly old URL's are appearing again?
If that is the case, how did you discover the old URL's still exist? From traffic to your site or researching G's index after noticing a drop in traffic?
It could actually be several different reasons. If you can provide some more information, I could possibly be more helpful.