| 5:37 am on Feb 9, 2012 (gmt 0)|
You either block the URL pattern or you let googlebot do it's thing. The pattern you describe is not likely to go on very long, however. How much bandwidth can a bunch of 404 responses use up, after all?
| 7:26 am on Feb 9, 2012 (gmt 0)|
Fair enough, thanks
| 10:50 am on Feb 9, 2012 (gmt 0)|
Does the 404 page for a non-existent entry have a "next" link pointing to a higher number?
If it has, Google could go on forever.
The same problem affects online calenders. I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past.
| 11:06 am on Feb 9, 2012 (gmt 0)|
No links at all, no. It auto-refreshes to the index after a few seconds but has no links.
Thanks for the thought though.
| 9:15 pm on Feb 9, 2012 (gmt 0)|
|I once saw a site with a calendar of events spanning less than a decade but where Google had crawled several thousand years into the future and into the past. |
Please! Not while I'm drinking my tea!
There's a scattering of unrelated posts that mention Google testing whether your site ever returns a 404. Apparently one way is to feed in an obviously bogus query.
Would be nice, wouldn't it, if the Crawl Errors distinguished between your own bona fide intentional links... and the ones they made up to test you.
| 11:05 pm on Feb 9, 2012 (gmt 0)|
Any chance it's not actually a google bot? Run a reverse DNS or a whois at ARIN on the IP address just to make sure. The sequential numbering seems a bit suspect given that the links never existed.
| 11:32 pm on Feb 9, 2012 (gmt 0)|
I had the same thing happen to me. I had a script that listed posts by numeric ID. Googlebot started throwing in random numbers and indexing them, even though there were no links leading them their. G indexed 40,000 non-existent URLs. I think it was because it was getting a 200 header code when you punched any number in there. I feel it was a big part of why I was Pandalized.
I added a few lines to the code to return 404s if you put in a bogus number. The index bogus pages dropped from 40,000 to actual 3,000 posts that exist.
I'm still panadalized, but I only went through one Panda refresh since. I hear that you need 2 Panda runs to get out of the Panda box. Finger crossed.
| 1:14 am on Feb 10, 2012 (gmt 0)|
About this time last year I fixed a site with a similar problem and dropped their indexing from ~25 000 bogus URLs to ~75 real pages. It took exactly six months for the WMT reports to catch up with reality.
| 9:26 am on Feb 10, 2012 (gmt 0)|
It's definitely Google, yes, and the activity appears to have eased off since I added a line to my robots forbidding crawling of the offending url's. Doesn't mean it'll stop of course but for now at least it's slowed considerably.
Was having a similar issue with profile pages on a forum too but again that has eased by adding a specific line to robots.txt.