Page is a not externally linkable
GoogleGuy - 4:08 am on Apr 11, 2003 (gmt 0)
In practice, there's lots of reasons that Google might not have the content of the page. There could be a robots.txt file, or the server could have been down, or we might have seen references to that page but not crawled it, or there could have been redirects, meta tags, etc. Personally, I think it's actually one of the strengths of Google that you can do a search like "Colorado virtual library" and we can return something like the first result. It turns out that www.aclin.org forbids all spiders, but Google is still able to pull descriptions from the Open Directory, for example. We saw a copy of Ben's report a couple days ago and mentioned that issue as quickly as we could. Many of the high profile examples (subsites of Apple, IBM, and so on) turn out to be the "didn't have the page" item rather than bad filtering in SafeSearch. Ben has already updated parts of his report and has been very nice in providing us with his data, so I expect we'll work together to find rough edges in SafeSearch and improve it. One important point to take away from Ben's report is that no filter can be 100% accurate. It's logical, but something good to remember. The report also serves as a to do list of places that we should contact and ask if they really meant to put up that robots.txt file. :) By the way, I'm a little rusty on robots.txt data, but I just tried colorado virtual library on a few other search engines, and found a couple that show content from crawling the page? Am I right that a couple engines don't seem to be respecting robots.txt? How come people never write reports about that? :)
I liked the report, but there is at least point to be aware of: SafeSearch doesn't return pages where we don't have the content. Basically, if we weren't able to fetch a page, we can't judge whether a page is safe or not. Since users deliberately have to opt in to the filter, it's pretty fair to assume that if we don't know if a page will be safe or not, we shouldn't return it--after all, the user told us that they would rather err on the side of safety by activating SafeSearch.