Long Post Ahead - I'm going to paraphrase a couple of things here.
From Declan's news.com article (linked above):
Google chooses not to include pages that use such files in SafeSearch listings because its crawler can't explore the entire site and thus, the company says, can't be expected to judge the site's content.
It sounds like Google stumbled upon robots.txt, and threw out the entire site from SafeSearch (because it could not judge the "entire" site's content).
However, the report presents an even more confusing stance:
To assess the rate at which robots.txt configurations cause SafeSearch omissions, the author obtained robots.txt files from all web servers listed in the results linked above. Approximately 11% of pages listed above are hosted on web servers that use robots.txt to block at least one robot from accessing the URL listed. The author does not know what user-agent is associated with Google's SafeSearch indexing, and this analysis therefore considers a URL affected by robots.txt exclusion if the site blocks any robot from accessing that URL.
It goes on to say:
Update (4/10/03): Google staff report that pages are also excluded from SafeSearch when Google fails to retain a copy of the page in its cache, a result that may occur when Google failed to crawl the page due to low pagerank, unreachable servers, or "noindex" instructions in page meta tags. Google staff suggest that these additional problems cause omission of an additional 13% to 15% of the web pages linked above, causing a total of 24% to 26% of the pages linked above are omitted from SafeSearch not due to affirmative miscategorization by SafeSearch but due to failure of Google to retain a copy of the pages in its cache.
It sounds to me like these guys had a fundamental misunderstanding of how robots.txt actually works, which is suprising, considering the type of survey they were trying to do. Consider this:
First, if a robots.txt file instructs web bots not to visit a site, it is unclear how Google came to index that site in the first place. Google's documentation indicates that the company supports and abides by robots.txt, so if a site uses a robots.txt file to exclude systems like Google's, it is unclear why the site would nonetheless remain listed in Google. Additional research is necessary to clarify this point, and the author anticipates a subsequent article on this topic exclusively.
It just sounds to me like these guys are trying to prove something that we already know, and they really didn't do that great of research. I've found the sites that had robots.txt, and I'll examine them to see what the real deal is.