Welcome to WebmasterWorld Guest from 23.22.17.192

Message Too Old, No Replies

Google crawling thousands of non-existent pages on my website

   
10:53 am on May 27, 2013 (gmt 0)

10+ Year Member



Ours is a fairly popular blog covering lifestyle and fashion. Our blog was running on Movable Type (a CMS) which we then shifted to Wordpress in November of 2012 with a brand new design.

I believe the problem of duplicate content started then. Its a bit lengthy so please take the time to go through it.
Movable Type paginates by adding a query to the index page. So for eg - it would be mywebsite.com/index.php?page=2 mywebsite.com/cars/index.php?page=35. We have close to 14000 posts so you can imagine the paginated pages across the categories and the home page. Wordpress on the other hand paginates in a different way. It is www.mywebsite.com/page/2/ and www.mywebsite.com/cars/page/35/.

Since we had been using Movable Type for close to 7 years now, the paginated url's are stored in Google's cache. So now Google is mixing both the parameters and crawling the pages for eg - it would crawl - www.mywebsite.com/page/2?page=2 and www.mywebsite.com/page/35/?page=20 and so forth. The combination and permutations for these run into hundreds of thousands. In the Google Webmaster Tools for this website in the URL's monitored section there are close to 110,000 url's monitored by Google for the variable page only.

Apart from this there are many incoming links from other websites like Polyvore and Pintrest who have linked to the old paginated pages.

Clearly Google does treats this as duplicate content (I can tell from the drop in serps) and because of the large number of rogue url's it is not crawling my website as it used to.

To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/, apart from this I have also added canonical references to all the pages.

It has been three months since I have done this, but yet the number of URL's monitored is the same on GWT and the serps are in the dump as well. Serps is the secondary issue but my primary concern is for Google to remove these pages from its index and cache. I have tried to remove these links manually too but the requests expire.

What is that I should do? I would want Google to crawl all the pages indexed once again so it can know a majority of them have been 301'd.
1:23 pm on May 27, 2013 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/

That can be a problem. Your server needs to return a 404 for any URL that never existed. Sites definitely lose rankings over time when they redirect everything instead of returning a true 404 where it's appropriate. The first sign is often a WMT warning about "soft 404s".

Other than that, a 301 redirect from the old style paginated URL to the new style is exactly what you should do. Kudos for that.
2:06 pm on May 27, 2013 (gmt 0)

10+ Year Member



Well I have got a Soft 404 warning a couple of days back. The problem was I had not used absolute url's for pagination. So if a user entered a query (before I set up the 301) mywebsite.com/page/2/query=webmasterworld all the pagination url's would point to their respective pages with that query. Some hackers had actually tagged rogue queries like Save Us from Berlusconi and some with random characters. I have solved that by modifying the plugin to display absolute url's. But not before Google has indexed and cached the pages with the rogue queries.
2:15 pm on May 27, 2013 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Nice catch. Still, you do need to fix the issue I pointed our so that "mixed" parameter URLs like mywebsite.com/page/35/?page=20 return a 404 status.
9:06 am on Jul 31, 2013 (gmt 0)



Morpheus,
Have you had any luck with correcting this issue and removing the bad URLs from the Google index?

I am going through a VERY similar problem but rely heavily on SERPS and just over half of my daily traffic is gone...

Please tell me it gets better!
1:18 am on Oct 23, 2013 (gmt 0)



Morpheus / Coleman123,

Were you able to recover from this issue? I've been dealing with a big G drop for the past few months and had similar findings. G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't. And, Polyvore just like your example above, is showing many old URLs.

I'm curious to know how you wound up resolving the issue if you were able to get past this.
1:44 am on Oct 23, 2013 (gmt 0)



abk717,

My issue was a result of multiple errors on my part. My programmer and I launched a new version of my website on May 1st and we forgot to update the robots.txt file for about 24 hours. That in itself wasn't a massive problem, but the new version of the site was setup with updated URL's on about half the site. Among several other issues along the way, during the 3rd or 4th day we had to revert to old version of site because new version was crashing. Quite the learning experience... Oh by the way, this was all happening right before/during some major algorithm updates from Google, so I was slapped hard.

For your issue:

"G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't."

Are you saying you setup a 301 redirect two years ago and URL's updated as planned, but suddenly old URL's are appearing again?

If that is the case, how did you discover the old URL's still exist? From traffic to your site or researching G's index after noticing a drop in traffic?

It could actually be several different reasons. If you can provide some more information, I could possibly be more helpful.
11:39 am on Dec 28, 2013 (gmt 0)

10+ Year Member



I set up my htaccess in a way to deliver 404 errors for non existent queries and included canonical on all the pages. The recovery has happened but my serps and nowhere close to what they were before.