| This 110 message thread spans 4 pages: 110 (  2 3 4 ) > > || |
|Report critical of adult filters|
Like Excite, Altavista, and most every other major search engine before it, Google has been criticized for marking and excluding some pages as adult oriented material. A report [cyber.law.harvard.edu] by Harvard Law School's Berkman Center for Internet & Society says that Google excludes too many pages.
Googles adult filter is widely regarded as one of the most lax in the search engine business. Pages that are perfectly acceptable in Google are routinely blocked by AskJeeves, Inktomi, and Altavista. If anything is the case here, Google is too lax in allowing some pages to remain visible while the adult filter is turned on.
Just last week some fellow webmasters were pointing potential adult content showing under innocoguous keywords.
From the report:
|Google might also inform webmasters as to steps they can take to assist with the proper categorization of their content. |
Simply not possible. It would only give those that would subvert the system, clues as to how to do it. One should remember the historical inncidents where standard searches on Altavista revealed adult content. That was done with full knowledge of the system.
Just two weeks ago I was on the phone with a service that provides paid inclusion to search engines. The issue at hand was that a few pages were being rejected for adult content. Those same pages walked right into the Google index without a problem. It is clear to most in the sem business, that Google's adult filter is not only the best available, it may even be too lax at times.
But it makes for great reports, and gives college kids with too much leisure time something to do.
It should be noted that Harvard has several graduates that are very high profile in this business and would stand to benefit from such a report. AskJeeves has Harvard Alumni on it's staff in mission critical positions. I do feel Harvard would be above such a conflict of interest, but not to note that fact in the report is an oversight.
Matt Cutts, author of Googles SafeSearch, will be a featured speaker at WebmasterWorlds marketing conference [webmasterworld.com] in Boston in two weeks. Additionally, Paul Gardi - a Harvard Alumi and Senior Vice President of Search for AskJeeves/Teoma, will also be speaking.
Value judgement. The Internet was designed as a communications network for academicians and scientists. It wasn't created as a playground for children. If parents wish to let their kids roam the Internet unsupervised, then the onus is on them to install censoring software. Not expect Google to do it for them.
I tend to agree. However, we both know that the majority of parents out there are more computer illiterate than their kids. Installing such software is not without risk to the system, and will often not work any better than what search engines produce. Additionally, those services work off of some of the same lists that search engines themselves use to flag questionable content.
A search engine must present a default index that is appropriate for a G or pg13 audience. The only thing the Harvard report does is point out the difficulty in doing that from a machine based system. The only way a search engine could do better than they are no doing is with a massive team of editors that hand viewed pages.
We run into the same **** problem here while trying to write machine code that can analyze and filter messages for content. It is often too agressive at performing the ****'ing task.
I don't understand the bit about the use of robots.txt files. Is there some way they can interact with SafeSearch? Obviously pages that are excluded as per the robots.txt file will not be in any listings, SafeSearch or otherwise.
Long Post Ahead - I'm going to paraphrase a couple of things here.
From Declan's news.com article (linked above):
|Google chooses not to include pages that use such files in SafeSearch listings because its crawler can't explore the entire site and thus, the company says, can't be expected to judge the site's content. |
It sounds like Google stumbled upon robots.txt, and threw out the entire site from SafeSearch (because it could not judge the "entire" site's content).
However, the report presents an even more confusing stance:
The report is here: [cyber.law.harvard.edu...]
From the actual report:
|To assess the rate at which robots.txt configurations cause SafeSearch omissions, the author obtained robots.txt files from all web servers listed in the results linked above. Approximately 11% of pages listed above are hosted on web servers that use robots.txt to block at least one robot from accessing the URL listed. The author does not know what user-agent is associated with Google's SafeSearch indexing, and this analysis therefore considers a URL affected by robots.txt exclusion if the site blocks any robot from accessing that URL. |
It goes on to say:
|Update (4/10/03): Google staff report that pages are also excluded from SafeSearch when Google fails to retain a copy of the page in its cache, a result that may occur when Google failed to crawl the page due to low pagerank, unreachable servers, or "noindex" instructions in page meta tags. Google staff suggest that these additional problems cause omission of an additional 13% to 15% of the web pages linked above, causing a total of 24% to 26% of the pages linked above are omitted from SafeSearch not due to affirmative miscategorization by SafeSearch but due to failure of Google to retain a copy of the pages in its cache. |
It sounds to me like these guys had a fundamental misunderstanding of how robots.txt actually works, which is suprising, considering the type of survey they were trying to do. Consider this:
|First, if a robots.txt file instructs web bots not to visit a site, it is unclear how Google came to index that site in the first place. Google's documentation indicates that the company supports and abides by robots.txt, so if a site uses a robots.txt file to exclude systems like Google's, it is unclear why the site would nonetheless remain listed in Google. Additional research is necessary to clarify this point, and the author anticipates a subsequent article on this topic exclusively. |
It just sounds to me like these guys are trying to prove something that we already know, and they really didn't do that great of research. I've found the sites that had robots.txt, and I'll examine them to see what the real deal is.
Okay, this is pretty cool.
I checked out the robots.txt files on all of the sites concerned. There are 40 sites that the report specified to be problematic due to robots.txt. The breakdown:
30 of the sites listed prevented spidering of the specific site content that was being tested, either referencing the page/directory directly in the robots.txt file, or doing a site-wide disallow in robots.txt. The pages, while not in SafeSearch, WERE LISTED IN THE MAIN INDEX. Also, usually another page that should have not been listed were listed under the offending page in the main index as an indented entry.
9 of sites listed prevented spidering of other resources on the site, but didn't specifically prohibit the site that didn't show up under SafeSearch. We can account for these, due to Google's explaination above.
1 of the sites had a robots.txt that seems to be dynamically generated based on the time of the day. The comment in the file says "Daytime instructions for search engines. Do not visit [sitename] during the day!" That seems like an incredibly stupid idea to me (who knows when the spider comes around?), but anyways...
I can only be led to believe that for some reason, on those 30 sites where the content is being specifically prohibited from being accessed by spiders via robots.txt, and still showing up in the main index, robots.txt is being partially ignored.
[Google: If interested, I do have the list.]
Does Google always respect robots.txt. If not, why not?
I liked the report, but there is at least point to be aware of: SafeSearch doesn't return pages where we don't have the content. Basically, if we weren't able to fetch a page, we can't judge whether a page is safe or not. Since users deliberately have to opt in to the filter, it's pretty fair to assume that if we don't know if a page will be safe or not, we shouldn't return it--after all, the user told us that they would rather err on the side of safety by activating SafeSearch.
In practice, there's lots of reasons that Google might not have the content of the page. There could be a robots.txt file, or the server could have been down, or we might have seen references to that page but not crawled it, or there could have been redirects, meta tags, etc. Personally, I think it's actually one of the strengths of Google that you can do a search like "Colorado virtual library" and we can return something like the first result. It turns out that www.aclin.org forbids all spiders, but Google is still able to pull descriptions from the Open Directory, for example.
We saw a copy of Ben's report a couple days ago and mentioned that issue as quickly as we could. Many of the high profile examples (subsites of Apple, IBM, and so on) turn out to be the "didn't have the page" item rather than bad filtering in SafeSearch. Ben has already updated parts of his report and has been very nice in providing us with his data, so I expect we'll work together to find rough edges in SafeSearch and improve it.
One important point to take away from Ben's report is that no filter can be 100% accurate. It's logical, but something good to remember. The report also serves as a to do list of places that we should contact and ask if they really meant to put up that robots.txt file. :)
By the way, I'm a little rusty on robots.txt data, but I just tried colorado virtual library on a few other search engines, and found a couple that show content from crawling the page? Am I right that a couple engines don't seem to be respecting robots.txt? How come people never write reports about that? :)
The robots.txt at the cited site (sorry) should exclude all 'bots from everything. Their webmaster is also "a little rusty," as the comment line indicates that he/she thinks that it allows robots to index "/". It does not.
It has a last-modified date of 7 Apr 2000, so the other SE's can't claim that it has changed since they spidered the site.
bakedjake, just to clarify based on your post, let's use a concrete example. www.aclin.org, the Colorado Virtual Library, has a robots.txt that prevents spiders from crawling it. Google abides by that robots.txt file--we *do not* crawl www.aclin.org.
However, we might be able to find external evidence on the web that www.aclin.org is a good match for the query "colorado virtual library." Maybe we found an entry in the Open Directory Project, Yahoo, or another directory. Maybe we saw references to it; it could have really good PageRank, which means that it's a reputable site--there are lots of ways. Truthfully, this is just one of those tiny little things that we do that improves Google and most people never even notice.
So when you type colorado virtual library, we return the best result we can (www.aclin.org) without ever having crawled that page. For example, you'll notice that there isn't a link to see the cached page, because we never crawled it. We don't really know what's on that page, because we never crawled it. Yet we can return it as a valid result for a query.
Let's bring things back to SafeSearch. With SafeSearch on, we think that www.aclin.org is a good match for colorado virtual library, but we don't actually know the content of the page--we aren't allowed to crawl it. Because we can't be sure whether the page is safe or not, we have to be conservative, so we can't return it.
You could look at it two ways. You could criticize Google for "dropping" www.aclin.org in SafeSearch due to failure of Google to retain a copy of the pages in its cache. The other way is to be happy that with SafeSearch off, Google is smart enough to return a page we never crawled as a relevant match. I prefer the second way, but that's just me. :)
>A search engine must present a default index that is appropriate for a G or pg13 audience.
NO. A search engine should default to results suitable for an adult audience. Any filtering should be an optional choice, or should be done by local software. The Internet is NOT a playground for children, and this shouldn't be assumed by default.
I agree, rfgdxm1--that's why it's off by default.
Thanks for the second pair of eyes, jdMorgan. Okay, so that robots.txt says "no spiders." But I see a couple major search engines actually showing crawled content from www.aclin.org? Maybe I was just up too late last night and I'm confused, but if some search engines aren't abiding by robots.txt that would concern me more than SafeSearch.
Mislead by prefs cookies again. I thought you guys switched that a few months back, but it was my preferences cookies that were set.
...I still think it should be on by default.
>...I still think it should be on by default.
Why? Most people wouldn't want censored results, and many may not know how to override the defaults. And, all people see is a list of SERPs. Really, if "<snip - adult terms>" for a page title is something that doesn't interest someone, nobody is forcing them to click that link.
[edited by: NFFC at 6:10 am (utc) on April 11, 2003]
[edit reason] Trying to keep off the filter ;) [/edit]
I see your point Brett. One thing to consider is that our rankings (because of PageRank and the link structure of the web) often lean more toward information sites. You usually have to look a little more deliberately to find porn on Google.
>>Why? Most people wouldn't want censored results
I do. I don't want to see adult sites. And outside of some webmasters I don't know one single person who does - particularly accidentally. If ladies are shopping for items for children with appropriate search terms they are not happy if they accidentally stumble upon fetishes and photographs.
>>In practice, there's lots of reasons that Google might not have the content of the page.
GoogleGuy, even if Google doesn't have the contents of a page, if it's listed in the Google Directory in a clearly adult category - Mature Content- with the category right there on the page with the site's listing, it shouldn't be coming up for that search. It plainly says on the page that you have to be 18 years of age to view it and be open minded.
The judgement was made about the content by human review when it was included by the editor at ODP. That just seems to be a slip-up with the directory. IMO anything that's listed in the Mature Content category should automatically qualify for exclusion if filters are on.
It's been a very, very rare thing to see a slip like that. I just happen to see that one sitting out there all the time when I check the category - and it's been in the cache all along.
I suppose in some areas you can't please all the people all the time. If a choice has to made between pleasing Harvard Law School or a Dad surfing with his 5 year old daughter I would strongly recommend siding with Dad!
I would like to see Google err on the side of caution with regards to safe search. I've said it before, misspellings should kick the filter in regardless of preferences.
>I suppose in some areas you can't please all the people all the time. If a choice has to made between pleasing Harvard Law School or a Dad surfing with his 5 year old daughter I would strongly recommend siding with Dad!
Oh, give me a freaking break. Like dad can't recognize an obvious porn site from looking at the SERPs? Seriously. How many times have you clicked on a link in a SERP with the title "wholesome family entertainment" and landed on a hard core porn site?
|The Internet is NOT a playground for children.... |
You may wish that were so but I have to tell you that every child I know (and I know a few having three of my own) thinks the internet is a great playground. As a result they use it constantly for everything you would expect from homework assignments to sports news, gaming to music, fan clubs to TV schedules etc. etc. In fact they probably make more use of it than their parents.
Of course it is the responsibility of parents, schools etc., to ensure that they do not have access to material that is obviously inappropriate for their age.
At some stage they learn to circumnavigate filtering software and this is probably an indication that they are intellectually old enough to handle the consequences.
In my computing career I have always allowed simple statistics to determine what the default functionality should be. In this case it seems quite obvious to me that the Google default should be 'adult filter on'. In practise of course it doesn't matter most of the time because those providing access to children require something much more sophisticated than a toggle on a website.
I think the filter should be off by default. Why? Not because I want adult sites in the SERPs, but because there will always be a reasonable amount of legitimate (i.e. non-adult sites) which will wrongly be marked as adult site by the algo. The error rate with that is pretty high. So prefiltering by default is a bad idea, IMHO.
I also have to agree with rfgdxm1, in that 99% of the time, you will be able to recognize such a page from the SERPs; and if in doubt DON'T click. ;)
I'll just slip out the back door now before it gets ugly in here. :) I just wanted to mention the issue with empty pages and explain how that affects the report on SafeSearch. It's good reading, but bear that in mind if you read it.
Whoa, This thread may get longer. A mention of the Harvard Report just ran across the bottom on CNN.
rfgdxm-This is one of the few times I disagree with you. Maybe a default of R, but not X. For example, when searching for Britney Spears, you could get a return of 'Britney Spears nude', but you couldn't get a result of 'Britney Spears takes it up the :o with a huge ;)'
Things are probably better now but these were both top twenty results last year when I was making a cd label for a present for one of my three daughters. As a result we simply don't let her search on the internet. I don't think thats a good thing either.
Harvard law is wrong! The sacrifice of our forefathers to establish the freedoms reflected in the first amendment was not to provide free access to dirty pictures but to guarantee political, scientific and religious expression.
A bastion of liberalism, such as Harvard Law could be expected to issue such a silly proclamation.
I got a whole new idea...
a metatag &/¦¦ robots.txt entry for all adult sites which says the content is adult. PR0 for life any which don't have it. Think about moving it into legislation as well as a requirement for all adult content pages.
[have absolutely nothing against adult content, but sometimes an industry has to embrace new controls for its own good]
My preference would be for Google to set the default to NO filtering. But it's hard to care, since you set it once and that's it.
However I do think that Google should highlight the existence of SafeSearch more. If you search in Google images, you get a clear and simple message saying "SafeSearch is off", with a link to the settings page. Why not do the same with regular search results?
I agree Jomaxx, especially since I'm among those who would NEVER want an inherently error-prone mathematical filter on my search by default. It's like putting glasses on me and my kids that fog up whenever I see something that MIGHT be a porno magazine cover comes into view (they're not hidden from public view in the country where I live), and constantly stumbling into scantily clad beautiful women as I walk around.
Actually, come to think of it...
Search engine filters are useless for blocking adult sites. Kids are computer-savvy enough to turn the Adult Filter" to OFF if they really want to find porn. However, adults often aren't as Net-savvy as kids and many probably aren't aware of the adult filter. Some of them will be wondering where all the adult sites have gone.
As a result, adults would have a harder time finding adult websites (not just porn but sites about romance/sexuality/advice) and it won't stop the kids from finding them.
I think the basic function of search engine filters is to remove spam results. It's far better to rely on porn-blocking software like NetNanny, Cyberfilter, etc.
If kids want to find adult pages - they are going to do it.
No matter how much you want to shelter your kids - it isn't going to work if they are determined.
It is not for search engines to decide what content is appropriate.
I am tired enough of society deciding this for me. I don't need google to do it as well.
I already have all my movies and everything else censored cause some kid might be watching. Parents need to teach their kids how to use the Internet safely - NOT DEGRADE MY INTERNET SO THEY CAN DO SO. I didn't get the pleasure of creating your rugrat - I don't want to have to take the responsibility as well.
<rant>I don't think we ought to Disney-fy the internet because some parents let their kids roam, unmonitored and unfiltered, around the web. No one wants to take responsibility for raising their kids anymore. Like it's the world's job to care about YOUR kid. And everything has to be dumbed down by default as a result. As if it's really so hard for a five year old to turn SE content filtering off anyway...
Personally, I would not prefer to be net nannied, nor do I know anyone who would. I would not prefer that the un-crawled sites that GoogleGuy mentioned be excluded from the SERPs by default. I would not want legitimate searches on "dual-use" keywords be filtered by default. I hear that this "Google" thing is a half-way decent search engine... probably won't be returning many hardcore porn results for innocuous queries anyway.
Let's not fall into the "think of the children!" hysteria... you can justify all kinds of censorship and Disneyfication when descending that slippery slope.</rant>
|One thing to consider is that our rankings (because of PageRank and the link structure of the web) often lean more toward information sites. |
I think that's less true today than it was a few years ago, at least in commercial categories where SEO is a factor and linking patterns are often artificial.
FWIW, I just searched on the term "breast," and the first 10 search results were for medical information sites. But when I searched on the Latin words for male and female genitals, a significant number of the results were for highly commercial topics such as enlargement pumps and adult novelties. This admittedly quick experiment, combined with observations in other categories such as travel, suggests that--because of artificial linking patterns that are encouraged by PageRank--Google has an unintended bias toward e-commerce pages if the topic is one that has serious moneymaking potential.
My preference is for no filtering to take place so that I get back all available information and then break it down myself. However it wouldn't kill me to simply have to turn Safesearch off by modifying my settings. Is it really that hard for people to read the directions and click a radiobox in their preferences?
I don't have kids and could care less what other peoples kids stumble upon but if it will create peace then please filter by default.
| This 110 message thread spans 4 pages: 110 (  2 3 4 ) > > |