Forum Moderators: open
Googles adult filter is widely regarded as one of the most lax in the search engine business. Pages that are perfectly acceptable in Google are routinely blocked by AskJeeves, Inktomi, and Altavista. If anything is the case here, Google is too lax in allowing some pages to remain visible while the adult filter is turned on.
Just last week some fellow webmasters were pointing potential adult content showing under innocoguous keywords.
From the report:
Google might also inform webmasters as to steps they can take to assist with the proper categorization of their content.
Simply not possible. It would only give those that would subvert the system, clues as to how to do it. One should remember the historical inncidents where standard searches on Altavista revealed adult content. That was done with full knowledge of the system.
Just two weeks ago I was on the phone with a service that provides paid inclusion to search engines. The issue at hand was that a few pages were being rejected for adult content. Those same pages walked right into the Google index without a problem. It is clear to most in the sem business, that Google's adult filter is not only the best available, it may even be too lax at times.
But it makes for great reports, and gives college kids with too much leisure time something to do.
Harvard Disclosure?
It should be noted that Harvard has several graduates that are very high profile in this business and would stand to benefit from such a report. AskJeeves has Harvard Alumni on it's staff in mission critical positions. I do feel Harvard would be above such a conflict of interest, but not to note that fact in the report is an oversight.
WebmasterWorld Disclosure:
Matt Cutts, author of Googles SafeSearch, will be a featured speaker at WebmasterWorlds marketing conference [webmasterworld.com] in Boston in two weeks. Additionally, Paul Gardi - a Harvard Alumi and Senior Vice President of Search for AskJeeves/Teoma, will also be speaking.
A search engine must present a default index that is appropriate for a G or pg13 audience. The only thing the Harvard report does is point out the difficulty in doing that from a machine based system. The only way a search engine could do better than they are no doing is with a massive team of editors that hand viewed pages.
We run into the same **** problem here while trying to write machine code that can analyze and filter messages for content. It is often too agressive at performing the ****'ing task.
From Declan's news.com article (linked above):
Google chooses not to include pages that use such files in SafeSearch listings because its crawler can't explore the entire site and thus, the company says, can't be expected to judge the site's content.
It sounds like Google stumbled upon robots.txt, and threw out the entire site from SafeSearch (because it could not judge the "entire" site's content).
However, the report presents an even more confusing stance:
The report is here: [cyber.law.harvard.edu...]
From the actual report:
To assess the rate at which robots.txt configurations cause SafeSearch omissions, the author obtained robots.txt files from all web servers listed in the results linked above. Approximately 11% of pages listed above are hosted on web servers that use robots.txt to block at least one robot from accessing the URL listed. The author does not know what user-agent is associated with Google's SafeSearch indexing, and this analysis therefore considers a URL affected by robots.txt exclusion if the site blocks any robot from accessing that URL.
It goes on to say:
Update (4/10/03): Google staff report that pages are also excluded from SafeSearch when Google fails to retain a copy of the page in its cache, a result that may occur when Google failed to crawl the page due to low pagerank, unreachable servers, or "noindex" instructions in page meta tags. Google staff suggest that these additional problems cause omission of an additional 13% to 15% of the web pages linked above, causing a total of 24% to 26% of the pages linked above are omitted from SafeSearch not due to affirmative miscategorization by SafeSearch but due to failure of Google to retain a copy of the pages in its cache.
It sounds to me like these guys had a fundamental misunderstanding of how robots.txt actually works, which is suprising, considering the type of survey they were trying to do. Consider this:
First, if a robots.txt file instructs web bots not to visit a site, it is unclear how Google came to index that site in the first place. Google's documentation indicates that the company supports and abides by robots.txt, so if a site uses a robots.txt file to exclude systems like Google's, it is unclear why the site would nonetheless remain listed in Google. Additional research is necessary to clarify this point, and the author anticipates a subsequent article on this topic exclusively.
It just sounds to me like these guys are trying to prove something that we already know, and they really didn't do that great of research. I've found the sites that had robots.txt, and I'll examine them to see what the real deal is.
I checked out the robots.txt files on all of the sites concerned. There are 40 sites that the report specified to be problematic due to robots.txt. The breakdown:
30 of the sites listed prevented spidering of the specific site content that was being tested, either referencing the page/directory directly in the robots.txt file, or doing a site-wide disallow in robots.txt. The pages, while not in SafeSearch, WERE LISTED IN THE MAIN INDEX. Also, usually another page that should have not been listed were listed under the offending page in the main index as an indented entry.
9 of sites listed prevented spidering of other resources on the site, but didn't specifically prohibit the site that didn't show up under SafeSearch. We can account for these, due to Google's explaination above.
1 of the sites had a robots.txt that seems to be dynamically generated based on the time of the day. The comment in the file says "Daytime instructions for search engines. Do not visit [sitename] during the day!" That seems like an incredibly stupid idea to me (who knows when the spider comes around?), but anyways...
I can only be led to believe that for some reason, on those 30 sites where the content is being specifically prohibited from being accessed by spiders via robots.txt, and still showing up in the main index, robots.txt is being partially ignored.
[Google: If interested, I do have the list.]
Does Google always respect robots.txt. If not, why not?
In practice, there's lots of reasons that Google might not have the content of the page. There could be a robots.txt file, or the server could have been down, or we might have seen references to that page but not crawled it, or there could have been redirects, meta tags, etc. Personally, I think it's actually one of the strengths of Google that you can do a search like "Colorado virtual library" and we can return something like the first result. It turns out that www.aclin.org forbids all spiders, but Google is still able to pull descriptions from the Open Directory, for example.
We saw a copy of Ben's report a couple days ago and mentioned that issue as quickly as we could. Many of the high profile examples (subsites of Apple, IBM, and so on) turn out to be the "didn't have the page" item rather than bad filtering in SafeSearch. Ben has already updated parts of his report and has been very nice in providing us with his data, so I expect we'll work together to find rough edges in SafeSearch and improve it.
One important point to take away from Ben's report is that no filter can be 100% accurate. It's logical, but something good to remember. The report also serves as a to do list of places that we should contact and ask if they really meant to put up that robots.txt file. :)
By the way, I'm a little rusty on robots.txt data, but I just tried colorado virtual library on a few other search engines, and found a couple that show content from crawling the page? Am I right that a couple engines don't seem to be respecting robots.txt? How come people never write reports about that? :)
The robots.txt at the cited site (sorry) should exclude all 'bots from everything. Their webmaster is also "a little rusty," as the comment line indicates that he/she thinks that it allows robots to index "/". It does not.
It has a last-modified date of 7 Apr 2000, so the other SE's can't claim that it has changed since they spidered the site.
Jim
However, we might be able to find external evidence on the web that www.aclin.org is a good match for the query "colorado virtual library." Maybe we found an entry in the Open Directory Project, Yahoo, or another directory. Maybe we saw references to it; it could have really good PageRank, which means that it's a reputable site--there are lots of ways. Truthfully, this is just one of those tiny little things that we do that improves Google and most people never even notice.
So when you type colorado virtual library, we return the best result we can (www.aclin.org) without ever having crawled that page. For example, you'll notice that there isn't a link to see the cached page, because we never crawled it. We don't really know what's on that page, because we never crawled it. Yet we can return it as a valid result for a query.
Let's bring things back to SafeSearch. With SafeSearch on, we think that www.aclin.org is a good match for colorado virtual library, but we don't actually know the content of the page--we aren't allowed to crawl it. Because we can't be sure whether the page is safe or not, we have to be conservative, so we can't return it.
You could look at it two ways. You could criticize Google for "dropping" www.aclin.org in SafeSearch due to failure of Google to retain a copy of the pages in its cache. The other way is to be happy that with SafeSearch off, Google is smart enough to return a page we never crawled as a relevant match. I prefer the second way, but that's just me. :)
NO. A search engine should default to results suitable for an adult audience. Any filtering should be an optional choice, or should be done by local software. The Internet is NOT a playground for children, and this shouldn't be assumed by default.
Thanks for the second pair of eyes, jdMorgan. Okay, so that robots.txt says "no spiders." But I see a couple major search engines actually showing crawled content from www.aclin.org? Maybe I was just up too late last night and I'm confused, but if some search engines aren't abiding by robots.txt that would concern me more than SafeSearch.
Why? Most people wouldn't want censored results, and many may not know how to override the defaults. And, all people see is a list of SERPs. Really, if "<snip - adult terms>" for a page title is something that doesn't interest someone, nobody is forcing them to click that link.
[edited by: NFFC at 6:10 am (utc) on April 11, 2003]
[edit reason] Trying to keep off the filter ;) [/edit]
I do. I don't want to see adult sites. And outside of some webmasters I don't know one single person who does - particularly accidentally. If ladies are shopping for items for children with appropriate search terms they are not happy if they accidentally stumble upon fetishes and photographs.
>>In practice, there's lots of reasons that Google might not have the content of the page.
GoogleGuy, even if Google doesn't have the contents of a page, if it's listed in the Google Directory in a clearly adult category - Mature Content- with the category right there on the page with the site's listing, it shouldn't be coming up for that search. It plainly says on the page that you have to be 18 years of age to view it and be open minded.
The judgement was made about the content by human review when it was included by the editor at ODP. That just seems to be a slip-up with the directory. IMO anything that's listed in the Mature Content category should automatically qualify for exclusion if filters are on.
It's been a very, very rare thing to see a slip like that. I just happen to see that one sitting out there all the time when I check the category - and it's been in the cache all along.
I would like to see Google err on the side of caution with regards to safe search. I've said it before, misspellings should kick the filter in regardless of preferences.
Oh, give me a freaking break. Like dad can't recognize an obvious porn site from looking at the SERPs? Seriously. How many times have you clicked on a link in a SERP with the title "wholesome family entertainment" and landed on a hard core porn site?
The Internet is NOT a playground for children....
You may wish that were so but I have to tell you that every child I know (and I know a few having three of my own) thinks the internet is a great playground. As a result they use it constantly for everything you would expect from homework assignments to sports news, gaming to music, fan clubs to TV schedules etc. etc. In fact they probably make more use of it than their parents.
Of course it is the responsibility of parents, schools etc., to ensure that they do not have access to material that is obviously inappropriate for their age.
At some stage they learn to circumnavigate filtering software and this is probably an indication that they are intellectually old enough to handle the consequences.
In my computing career I have always allowed simple statistics to determine what the default functionality should be. In this case it seems quite obvious to me that the Google default should be 'adult filter on'. In practise of course it doesn't matter most of the time because those providing access to children require something much more sophisticated than a toggle on a website.
I also have to agree with rfgdxm1, in that 99% of the time, you will be able to recognize such a page from the SERPs; and if in doubt DON'T click. ;)
rfgdxm-This is one of the few times I disagree with you. Maybe a default of R, but not X. For example, when searching for Britney Spears, you could get a return of 'Britney Spears nude', but you couldn't get a result of 'Britney Spears takes it up the :o with a huge ;)'
Things are probably better now but these were both top twenty results last year when I was making a cd label for a present for one of my three daughters. As a result we simply don't let her search on the internet. I don't think thats a good thing either.
Harvard law is wrong! The sacrifice of our forefathers to establish the freedoms reflected in the first amendment was not to provide free access to dirty pictures but to guarantee political, scientific and religious expression.
A bastion of liberalism, such as Harvard Law could be expected to issue such a silly proclamation.
[have absolutely nothing against adult content, but sometimes an industry has to embrace new controls for its own good]
However I do think that Google should highlight the existence of SafeSearch more. If you search in Google images, you get a clear and simple message saying "SafeSearch is off", with a link to the settings page. Why not do the same with regular search results?
Actually, come to think of it...
As a result, adults would have a harder time finding adult websites (not just porn but sites about romance/sexuality/advice) and it won't stop the kids from finding them.
I think the basic function of search engine filters is to remove spam results. It's far better to rely on porn-blocking software like NetNanny, Cyberfilter, etc.
No matter how much you want to shelter your kids - it isn't going to work if they are determined.
It is not for search engines to decide what content is appropriate.
I am tired enough of society deciding this for me. I don't need google to do it as well.
I already have all my movies and everything else censored cause some kid might be watching. Parents need to teach their kids how to use the Internet safely - NOT DEGRADE MY INTERNET SO THEY CAN DO SO. I didn't get the pleasure of creating your rugrat - I don't want to have to take the responsibility as well.
Personally, I would not prefer to be net nannied, nor do I know anyone who would. I would not prefer that the un-crawled sites that GoogleGuy mentioned be excluded from the SERPs by default. I would not want legitimate searches on "dual-use" keywords be filtered by default. I hear that this "Google" thing is a half-way decent search engine... probably won't be returning many hardcore porn results for innocuous queries anyway.
Let's not fall into the "think of the children!" hysteria... you can justify all kinds of censorship and Disneyfication when descending that slippery slope.</rant>
One thing to consider is that our rankings (because of PageRank and the link structure of the web) often lean more toward information sites.
I think that's less true today than it was a few years ago, at least in commercial categories where SEO is a factor and linking patterns are often artificial.
FWIW, I just searched on the term "breast," and the first 10 search results were for medical information sites. But when I searched on the Latin words for male and female genitals, a significant number of the results were for highly commercial topics such as enlargement pumps and adult novelties. This admittedly quick experiment, combined with observations in other categories such as travel, suggests that--because of artificial linking patterns that are encouraged by PageRank--Google has an unintended bias toward e-commerce pages if the topic is one that has serious moneymaking potential.
I don't have kids and could care less what other peoples kids stumble upon but if it will create peace then please filter by default.