homepage Welcome to WebmasterWorld Guest from 54.198.130.203
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google crawling unnecessarily
cyberdyne




msg:4415628
 9:36 pm on Feb 8, 2012 (gmt 0)

I recently allowed Google to start crawling my guest book whereas previously it was blocked via robots.txt. I've also recently removed the .rss facility from my guest book (unrelated reason).

The problem is that Google is now incessantly crawling non-existent .rss pages in the format: /guesbook/rss.php?entry={number}
It is even crawling numbers as high as 99999 which never even existed in the first place.

How can I stop this from happening please? Even though Google is being served 404's, the huge number of them is using considerable bandwidth.

Thanks in advance

 

tedster




msg:4415727
 5:37 am on Feb 9, 2012 (gmt 0)

You either block the URL pattern or you let googlebot do it's thing. The pattern you describe is not likely to go on very long, however. How much bandwidth can a bunch of 404 responses use up, after all?

cyberdyne




msg:4415743
 7:26 am on Feb 9, 2012 (gmt 0)

Fair enough, thanks

g1smd




msg:4415795
 10:50 am on Feb 9, 2012 (gmt 0)

Does the 404 page for a non-existent entry have a "next" link pointing to a higher number?

If it has, Google could go on forever.

The same problem affects online calenders. I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past.

cyberdyne




msg:4415800
 11:06 am on Feb 9, 2012 (gmt 0)

No links at all, no. It auto-refreshes to the index after a few seconds but has no links.

Thanks for the thought though.

lucy24




msg:4416062
 9:15 pm on Feb 9, 2012 (gmt 0)

I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past.

Please! Not while I'm drinking my tea!

There's a scattering of unrelated posts that mention Google testing whether your site ever returns a 404. Apparently one way is to feed in an obviously bogus query.

Would be nice, wouldn't it, if the Crawl Errors distinguished between your own bona fide intentional links... and the ones they made up to test you.

manny123




msg:4416092
 11:05 pm on Feb 9, 2012 (gmt 0)

Any chance it's not actually a google bot? Run a reverse DNS or a whois at ARIN on the IP address just to make sure. The sequential numbering seems a bit suspect given that the links never existed.

Pjman




msg:4416097
 11:32 pm on Feb 9, 2012 (gmt 0)

I had the same thing happen to me. I had a script that listed posts by numeric ID. Googlebot started throwing in random numbers and indexing them, even though there were no links leading them their. G indexed 40,000 non-existent URLs. I think it was because it was getting a 200 header code when you punched any number in there. I feel it was a big part of why I was Pandalized.

I added a few lines to the code to return 404s if you put in a bogus number. The index bogus pages dropped from 40,000 to actual 3,000 posts that exist.

I'm still panadalized, but I only went through one Panda refresh since. I hear that you need 2 Panda runs to get out of the Panda box. Finger crossed.

g1smd




msg:4416115
 1:14 am on Feb 10, 2012 (gmt 0)

About this time last year I fixed a site with a similar problem and dropped their indexing from ~25 000 bogus URLs to ~75 real pages. It took exactly six months for the WMT reports to catch up with reality.

cyberdyne




msg:4416187
 9:26 am on Feb 10, 2012 (gmt 0)

It's definitely Google, yes, and the activity appears to have eased off since I added a line to my robots forbidding crawling of the offending url's. Doesn't mean it'll stop of course but for now at least it's slowed considerably.
Was having a similar issue with profile pages on a forum too but again that has eased by adding a specific line to robots.txt.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved