Forum Moderators: Robert Charlton & goodroi
AOL Research apparently inadvertantly posted a gzipped tarball of 500,000 websites sampled including every keyword they are ranking for. This was first discovered at the url:
[research.aol.com...]
[edited by: engine at 5:06 pm (utc) on Aug. 7, 2006]
AOL on Monday said it released a small portion of keyword search information for about 658,000 anonymous AOL users in a move that ignited a firestorm of criticism on the Internet amid calls for tighter protection of the privacy of users' Web searches.The Internet division of media conglomerate Time Warner Inc. released search information on about 20 million searches done from its AOL software over a three-month period.
The data was released about 10 days ago on its own publicly accessible research Web site, but it escaped notice until this weekend.
AOL draws fire after releasing search data [today.reuters.co.uk]
We survey many of the measures used to describe and evaluate the efficiency and effectiveness of large-scale search services. These measures, herein visualized versus verbalized, reveal a domain rich in complexity and scale. We cover six principle facets of search: the query space, users' query sessions, user behavior, operational requirements, the content space, and user demographics. While this paper focuses on measures, the measurements themselves raise questions and suggest avenues of further investigation.
If I already have reached position 6, is it worth 50 hours of my time to inch my way up to position 5 through link building, content tweaking, etc? Now that question can finally be answered.
Then again, that assumes somebody will find the data file, do the analysis, and tell me the results, which seems unlikely. Oh well.
It’s normalised as the fraction of all searches that produce a click on the organic results, and averaged over the 36 million searches. Note that AOL places 10 results on the first page, and 15 on each following page, so that the very visible “bottom of page” boosts occurs at rank = 10, 25, 40, 55, 70, and so on; rank=1 gets a 22.59% click-through, and 46.58% of searches result in no organic click-through at all.
It’s normalised as the fraction of all searches that produce a click on the organic results
I've got it in front of me and I don't think this is the case. Here is the readme file that comes with the tar ball.
-=-=-=-=-=-
This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.
The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.
The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.
Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.
However, this is a publicity stunt. There's no way AOL accidently packaged the exact data that Google refused to give to the DoJ and published it to their website. Back then, remember everyone was clamering for the SEs to release the same info to the public.
AOL thought they would use that sentiment (months late of course) to build press for the "new, open AOL". However, it quickly backfired and now their PR people have turned it into a mistake by researchers.
See the slippery slope of the Govt' asking for data? Is user 17556639 doing reaserch for a book, maybe he saw a CSI episode and followed up link after link, or is he planing to really kill his wife?
I would have to agree that this seems like it was a planned release and now that it got the wrong response, theres a need for damage control. But how much damage control can be done now that the data is everywhere.
With AOL Research, we are excited to introduce an open-research community
... AOL Research features an open web site where researchers can present ideas, ...[research.aol.com...]
So this information could well have been posted by a researcher in the wider web community, and not necessarily a TimeWarner emplyeee -- a researcher who was unaware of any potential fallout. I tend to believe AOL on this one and do not suspect any intentional publicity stunt. Still, it's a good question to ask -- who had original access to that data in order to post it?
<It's almost impossibe for me to access research.aol.com
right now. Their server load must be immense.>
There's no way AOL accidently packaged the exact data that Google refused to give to the DoJ and published it to their website.
Is that why this is in the Google Search News forum?
Maybe I'm dreaming or something, but I don't remember ever wondering if I'm dreaming while actually dreaming. At the moment, all I'm thinking is woohoo!
Still, it's a good question to ask -- who had original access to that data in order to post it?
Not necessarily pertinent to the current situation, but in the past, SEs have been contacted by universities to use their access logs for research purposes. There is always the possibility that such logs can be obtained from the university and posted to public places.
This is mirrored all over the net, I'm afraid. It's even on bit torrent.
If you need to get up to speed on bittorrent first, check Wikipedia for more information. Please do not use this thread as a focus point for distributing the file itself.
I know most of my younger friends are all somewhat star struck and have searched up their own name at one time or another, usually using different variations to see what comes up.
Let's just hope they weren't paranoid too and typed in their SIN to see if they were susceptible to identity theft?
I've been removing posts all day asking how to get a copy of the file. Please note what blaze said above:
This is mirrored all over the net, I'm afraid. It's even on bit torrent.If you need to get up to speed on bittorrent first, check Wikipedia for more information. Please do not use this thread as a focus point for distributing the file itself.
The file is all over the place - it's easier to find than a vi@gr@ SERP. The real meaty topic here is the fact that the data got spilled, not distributing it.
Thanks,
TJ
I have a feeling that this fiasco is only starting. Imagine the repercussions once AOL users realise that this data is public. I mean I just did a sample search on one anonymous ID and 2153 results were returned. That is truly shocking.
Anyone who's website has had hits recorded in this file can cross-reference them to find the real user's IP, and if it's a membership site - their user information.
e.g. Brett could find out exactly who user X is, and see their records. I can see lots of hits to webmasterworld.com in the files - and I bet many of them are from members.
I do not believe Brett would do such a thing, but not everyone is as ethical as Brett.
Apologies for the double post - missed the edit window :(