Welcome to WebmasterWorld Guest from 54.162.240.235

Message Too Old, No Replies

Google crawling unnecessarily

     
9:36 pm on Feb 8, 2012 (gmt 0)

10+ Year Member



I recently allowed Google to start crawling my guest book whereas previously it was blocked via robots.txt. I've also recently removed the .rss facility from my guest book (unrelated reason).

The problem is that Google is now incessantly crawling non-existent .rss pages in the format: /guesbook/rss.php?entry={number}
It is even crawling numbers as high as 99999 which never even existed in the first place.

How can I stop this from happening please? Even though Google is being served 404's, the huge number of them is using considerable bandwidth.

Thanks in advance
5:37 am on Feb 9, 2012 (gmt 0)

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member



You either block the URL pattern or you let googlebot do it's thing. The pattern you describe is not likely to go on very long, however. How much bandwidth can a bunch of 404 responses use up, after all?
7:26 am on Feb 9, 2012 (gmt 0)

10+ Year Member



Fair enough, thanks
10:50 am on Feb 9, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Does the 404 page for a non-existent entry have a "next" link pointing to a higher number?

If it has, Google could go on forever.

The same problem affects online calenders. I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past.
11:06 am on Feb 9, 2012 (gmt 0)

10+ Year Member



No links at all, no. It auto-refreshes to the index after a few seconds but has no links.

Thanks for the thought though.
9:15 pm on Feb 9, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past.

Please! Not while I'm drinking my tea!

There's a scattering of unrelated posts that mention Google testing whether your site ever returns a 404. Apparently one way is to feed in an obviously bogus query.

Would be nice, wouldn't it, if the Crawl Errors distinguished between your own bona fide intentional links... and the ones they made up to test you.
11:05 pm on Feb 9, 2012 (gmt 0)

5+ Year Member



Any chance it's not actually a google bot? Run a reverse DNS or a whois at ARIN on the IP address just to make sure. The sequential numbering seems a bit suspect given that the links never existed.
11:32 pm on Feb 9, 2012 (gmt 0)



I had the same thing happen to me. I had a script that listed posts by numeric ID. Googlebot started throwing in random numbers and indexing them, even though there were no links leading them their. G indexed 40,000 non-existent URLs. I think it was because it was getting a 200 header code when you punched any number in there. I feel it was a big part of why I was Pandalized.

I added a few lines to the code to return 404s if you put in a bogus number. The index bogus pages dropped from 40,000 to actual 3,000 posts that exist.

I'm still panadalized, but I only went through one Panda refresh since. I hear that you need 2 Panda runs to get out of the Panda box. Finger crossed.
1:14 am on Feb 10, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



About this time last year I fixed a site with a similar problem and dropped their indexing from ~25 000 bogus URLs to ~75 real pages. It took exactly six months for the WMT reports to catch up with reality.
9:26 am on Feb 10, 2012 (gmt 0)

10+ Year Member



It's definitely Google, yes, and the activity appears to have eased off since I added a line to my robots forbidding crawling of the offending url's. Doesn't mean it'll stop of course but for now at least it's slowed considerably.
Was having a similar issue with profile pages on a forum too but again that has eased by adding a specific line to robots.txt.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month