Oh yeah, I know about this!
In one section of my site, I have some URLs in this format site.com/Subject109999.
Database driven, of course. Now, that database tops out at, say, 101000. 101001 does not exist. In fact, a long time ago, we wrote our site so that if you did hit 101001, or any other number, it would just go to the home page: it had nothing to show.
Well, in September 2009, googlebot started guessing on numbers, and it grabbed about 40,000 (yes, 40,000) different URLs in this sequence.
They don't exist. Our database cannot kick them out. And no one would link to those 40,000 URLs. It was the result of googlebot drawing some assumptions, and going fishing.
The end result was, I got 40,000 bad URLs indexed in the SERPs on this one section of my site. Oh, and googlebot did them sequentially, so, they go from 1000 to 41,000, then stop. Probably that is as far as it got until I stopped it with a noindex.
The crappy thing is, they are tough to get out. Oh, I figured this out back in October, and immediately did the obligatory noindex. No problem. However, they are not coming out of the SERPs very fast.
So, I am deleting the URLs by hand using the URL Removal tool. Problem is, I cannot delete the directory, because I have to safeguard the first 1000 good URLs, and focus on deleting out all the ones above 1000. Thank god I found a greasmonkey function that preps them 100 at a time for me. Oh, and I have also discovered that the URL Removal Tool craps out around 1000 per day. I do one batch of 100 every hour or so, and do as many batches as I can during the day. I still have about 26,000 to dump out, which will take me another 3-4 weeks.
Oh, and to add some comedy to the whole situation, google also sent me an email complaining about how I have too many URLs. Thanks.
I had another similar situation with 5000 duplicates of my home page from September too. My home page disappeared from the SERPs. Gone! I chased it with httpd.conf, and then noindex. Waited 3 weeks. They were coming out at a rate of about 100 a week. That meant about a year until they'd all be gone. I finally said, scr-w this, used the URL Removal Tool to dump the directory, they were all gone in 6 hours or so, and 2 days later, my home page was back with my 8 sitelinks.
Ah, and one more thing in this tale: I can search a specific string from the 40,000 + pages, and google reports 1 result. ie the strign seardch says the 40000 don't exist. However, about once a week, it reveals more - it will give me a list of 10-50 that do exist. And sometimes it also reveals the real number, it will report 1-10 of around 30,000 results, but still only display about 10-50. So, don't trust what google reports - they are hidden!
These 2 experiences have really adjusted my impression of numbers reported in google serps, and how googlebot works. Don't trust either. Assume nothing.
Even right now, I just checked,and it reports 1 result on those 26,000 remaining. I know it is still around 26,000 because I have another tool which checked each URL, one by one, to get the real count and to see if it is indexed or not. Checking the URLs, URL by URL, and not by a string, reveals that there are 26,000 that exist in the SERPs.
So that's my whole story, showing how I know it is true that googlebot is doing some fuzzy logic on the back end, surmising which URLs might exist based upon patterns it feels it has detected. Even if those URLs don't exist and there are no links to them. So make sure your site is airtight and all exceptions are noindexed, and do not trust string searches - search for specific URLs.