The data I gathered from that crawl is very interesting to me. Ok, I'm an seo wonk and anything related is interesting to me.
I've not collated the last batch of data yet. It takes a very long time to study that many files. All done now and ended up with 180-210k robots.txt files (not counted yet). Those are just the semi valid ones. Of those I have collated data on, here are some highlights:
About half of robots.txt are in msdos format. (should be unix line enders).
About 60% of all requests for robots.txt ended up as a redirects to an html page. This is not good server configuration to have. SE's do have to deal with it, but it is really bad style.
About 6% of all robots.txt are not valid and many search engines will ignore them.
Common fatal and near fatal errors: These are errors that would give a spider cause for concern about the validity of the file:
Multiple disallows per line. Only one disallow is acceptable per line - you can't combine disallows.
There is no ALLOW tag.
Wild gyrations in formatting. From structured formatting with spaces at the begging of the line, to attempts at multi line comments, there are so many variations, that it makes me wonder what se's do with them.
Size. There were hundreds of robots.txt that were near or over a megabyte in size. I simply can't imagine a search engine using that file as valid. When you consider the overhead involved in parsing a file that size, some se boxes would literally run out of memory. I don't know if there is a "safe size", but a meg is in the questionable range. It is represents bad server/directory setup. eg: Ban whole directories, not 10k files IN the directory.
Doc format. Yes, we ran into 50+ robots.txt that were in microsoft word format. No kidding - loaded some of them up in word, and there was a very pretty looking robots.txt.
HTTP redirects. Ran into many robots.txt that were valid, but they were parked under a http redirect. Questionable if the se's would think of that as valid. (ex: foo.com/robots.txt redirected to foo.com/bar/robots.txt or foo.com/robots2.txt)
Bogus txt files: hit a huge server farm that was loading robots.txt with keyword lists...why? who knows.
We identified over 200 "server farms" or "domain farms" simply by the identical nature of their robots.txt. (keep that in mind cloakers). The largest we found was a robots.txt duplicated on over 800 domains.
Early Phase One: