homepage Welcome to WebmasterWorld Guest from 54.166.122.65
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google crawling thousands of non-existent pages on my website
morpheus83

10+ Year Member



 
Msg#: 4578166 posted 10:53 am on May 27, 2013 (gmt 0)

Ours is a fairly popular blog covering lifestyle and fashion. Our blog was running on Movable Type (a CMS) which we then shifted to Wordpress in November of 2012 with a brand new design.

I believe the problem of duplicate content started then. Its a bit lengthy so please take the time to go through it.
Movable Type paginates by adding a query to the index page. So for eg - it would be mywebsite.com/index.php?page=2 mywebsite.com/cars/index.php?page=35. We have close to 14000 posts so you can imagine the paginated pages across the categories and the home page. Wordpress on the other hand paginates in a different way. It is www.mywebsite.com/page/2/ and www.mywebsite.com/cars/page/35/.

Since we had been using Movable Type for close to 7 years now, the paginated url's are stored in Google's cache. So now Google is mixing both the parameters and crawling the pages for eg - it would crawl - www.mywebsite.com/page/2?page=2 and www.mywebsite.com/page/35/?page=20 and so forth. The combination and permutations for these run into hundreds of thousands. In the Google Webmaster Tools for this website in the URL's monitored section there are close to 110,000 url's monitored by Google for the variable page only.

Apart from this there are many incoming links from other websites like Polyvore and Pintrest who have linked to the old paginated pages.

Clearly Google does treats this as duplicate content (I can tell from the drop in serps) and because of the large number of rogue url's it is not crawling my website as it used to.

To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/, apart from this I have also added canonical references to all the pages.

It has been three months since I have done this, but yet the number of URL's monitored is the same on GWT and the serps are in the dump as well. Serps is the secondary issue but my primary concern is for Google to remove these pages from its index and cache. I have tried to remove these links manually too but the requests expire.

What is that I should do? I would want Google to crawl all the pages indexed once again so it can know a majority of them have been 301'd.

 

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4578166 posted 1:23 pm on May 27, 2013 (gmt 0)

To get out of this what I have done is set up a 301 redirect for all queries. So for eg - mywebsite.com/page/35/?page=20 will be redirected to mywebsite.com/page/35/

That can be a problem. Your server needs to return a 404 for any URL that never existed. Sites definitely lose rankings over time when they redirect everything instead of returning a true 404 where it's appropriate. The first sign is often a WMT warning about "soft 404s".

Other than that, a 301 redirect from the old style paginated URL to the new style is exactly what you should do. Kudos for that.

morpheus83

10+ Year Member



 
Msg#: 4578166 posted 2:06 pm on May 27, 2013 (gmt 0)

Well I have got a Soft 404 warning a couple of days back. The problem was I had not used absolute url's for pagination. So if a user entered a query (before I set up the 301) mywebsite.com/page/2/query=webmasterworld all the pagination url's would point to their respective pages with that query. Some hackers had actually tagged rogue queries like Save Us from Berlusconi and some with random characters. I have solved that by modifying the plugin to display absolute url's. But not before Google has indexed and cached the pages with the rogue queries.

tedster

WebmasterWorld Senior Member tedster us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4578166 posted 2:15 pm on May 27, 2013 (gmt 0)

Nice catch. Still, you do need to fix the issue I pointed our so that "mixed" parameter URLs like mywebsite.com/page/35/?page=20 return a 404 status.

Coleman123



 
Msg#: 4578166 posted 9:06 am on Jul 31, 2013 (gmt 0)

Morpheus,
Have you had any luck with correcting this issue and removing the bad URLs from the Google index?

I am going through a VERY similar problem but rely heavily on SERPS and just over half of my daily traffic is gone...

Please tell me it gets better!

abk717



 
Msg#: 4578166 posted 1:18 am on Oct 23, 2013 (gmt 0)

Morpheus / Coleman123,

Were you able to recover from this issue? I've been dealing with a big G drop for the past few months and had similar findings. G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't. And, Polyvore just like your example above, is showing many old URLs.

I'm curious to know how you wound up resolving the issue if you were able to get past this.

Coleman123



 
Msg#: 4578166 posted 1:44 am on Oct 23, 2013 (gmt 0)

abk717,

My issue was a result of multiple errors on my part. My programmer and I launched a new version of my website on May 1st and we forgot to update the robots.txt file for about 24 hours. That in itself wasn't a massive problem, but the new version of the site was setup with updated URL's on about half the site. Among several other issues along the way, during the 3rd or 4th day we had to revert to old version of site because new version was crashing. Quite the learning experience... Oh by the way, this was all happening right before/during some major algorithm updates from Google, so I was slapped hard.

For your issue:

"G is showing pages with old URLs that haven't existed in nearly 2 years because there is a 301 redirect to the new URL. I would have thought that it would show the new URL, but it doesn't."

Are you saying you setup a 301 redirect two years ago and URL's updated as planned, but suddenly old URL's are appearing again?

If that is the case, how did you discover the old URL's still exist? From traffic to your site or researching G's index after noticing a drop in traffic?

It could actually be several different reasons. If you can provide some more information, I could possibly be more helpful.

morpheus83

10+ Year Member



 
Msg#: 4578166 posted 11:39 am on Dec 28, 2013 (gmt 0)

I set up my htaccess in a way to deliver 404 errors for non existent queries and included canonical on all the pages. The recovery has happened but my serps and nowhere close to what they were before.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved