MikeNoLastName - 11:46 pm on Jul 22, 2012 (gmt 0)
Didn't I hear somewhere recently that G is now scanning non-linking text on a page for possible new sites to index?
I've also just noticed that G started indexing some random files in our disllowed directories. They're in cgi-bin and show up in supplemental results when doing site: example.com. We've had /cgi-bin disallowed for all robots for many years, AND I made sure nothing there is included in the sitemap.xml, yet I can still see at least 3 files: 2 .htm's and one called x.out indexed just with the URL and no descriptions. Can't put noindex in them because they are not actually complete .htm files, they're used as template includes for other pages and in the case of x.out is just a plain old data base!
I did a fetch as googlebot in WMT and they all came back as "Denied by robots.txt", but you can't submit them in that case, so hopefully G will get a clue anyway.