AOL "Angry & Upset" After Releasing Search Data

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

AOL "Angry & Upset" After Releasing Search Data

nuclei

2:53 am on Aug 7, 2006 (gmt 0)

Ouch! The title says it all.

AOL Research apparently inadvertantly posted a gzipped tarball of 500,000 websites sampled including every keyword they are ranking for. This was first discovered at the url:

[research.aol.com...]

[edited by: engine at 5:06 pm (utc) on Aug. 7, 2006]

bakedjake

4:32 pm on Aug 7, 2006 (gmt 0)

I just saw this story on CNBC. Interesting that it's getting national news attention.

engine

4:40 pm on Aug 7, 2006 (gmt 0)

AOL on Monday said it released a small portion of keyword search information for about 658,000 anonymous AOL users in a move that ignited a firestorm of criticism on the Internet amid calls for tighter protection of the privacy of users' Web searches.
The Internet division of media conglomerate Time Warner Inc. released search information on about 20 million searches done from its AOL software over a three-month period.
The data was released about 10 days ago on its own publicly accessible research Web site, but it escaped notice until this weekend.

AOL draws fire after releasing search data [today.reuters.co.uk]

bhartzer

4:46 pm on Aug 7, 2006 (gmt 0)

This is one time when I wish the cached versions were slower than they actually are. The original research paper, though, called "A picture of Search", provides some interesting data and is still available.

We survey many of the measures used to describe and evaluate the efficiency and effectiveness of large-scale search services. These measures, herein visualized versus verbalized, reveal a domain rich in complexity and scale. We cover six principle facets of search: the query space, users' query sessions, user behavior, operational requirements, the content space, and user demographics. While this paper focuses on measures, the measurements themselves raise questions and suggest avenues of further investigation.

blaze

5:00 pm on Aug 7, 2006 (gmt 0)

This is mirrored all over the net, I'm afraid. It's even on bit torrent.

cerebrum

5:49 pm on Aug 7, 2006 (gmt 0)

Golden data for search and affiliate spammers. Good job AOL. Thank god, I am not an AOL customer.

The data is mirrored all over the internet on torrents and file sharing websites.

AtBatt

6:01 pm on Aug 7, 2006 (gmt 0)

hey, bit torrent me over some of that spam data. this doesnt suprise me in the least regarding AOL's long time practice(s). i am just suprised AOL hasn't been shut down yet.

zomega42

6:13 pm on Aug 7, 2006 (gmt 0)

Yes, this will be used for nefarious purposes. But that doesn't mean even us whitehats can't learn something -- I for one will be very interested to see how the distribution of clicks depends on the serp rank (evidently the data includes the ranking of the link that was clicked).

If I already have reached position 6, is it worth 50 hours of my time to inch my way up to position 5 through link building, content tweaking, etc? Now that question can finally be answered.

Then again, that assumes somebody will find the data file, do the analysis, and tell me the results, which seems unlikely. Oh well.

nuclei

6:34 pm on Aug 7, 2006 (gmt 0)

from a post by spline on another forum:

It�s normalised as the fraction of all searches that produce a click on the organic results, and averaged over the 36 million searches. Note that AOL places 10 results on the first page, and 15 on each following page, so that the very visible �bottom of page� boosts occurs at rank = 10, 25, 40, 55, 70, and so on; rank=1 gets a 22.59% click-through, and 46.58% of searches result in no organic click-through at all.

maherphil

7:05 pm on Aug 7, 2006 (gmt 0)

It�s normalised as the fraction of all searches that produce a click on the organic results

I've got it in front of me and I don't think this is the case. Here is the readme file that comes with the tar ball.
-=-=-=-=-=-

This collection consists of ~20M web queries collected from ~650k users over three months.
The data is sorted by anonymous user ID and sequentially arranged.

The goal of this collection is to provide real query log data that is based on real users. It could be used for personalization, query reformulation or other types of search research.

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.

Each line in the data represents one of two types of events:
1. A query that was NOT followed by the user clicking on a result item.
2. A click through on an item in the result list returned from a query.
In the first case (query only) there is data in only the first three columns/fields -- namely AnonID, Query, and QueryTime (see above).
In the second case (click through), there is data in all five columns. For click through events, the query that preceded the click through is included. Note that if a user clicked on more than one result in the list returned from a single query, there will be TWO lines in the data to represent the two events. Also note that if the user requested the next "page" or results for some query, this appears as a subsequent identical query with a later time stamp.

your_store

7:50 pm on Aug 7, 2006 (gmt 0)

The data is beautiful. I've never seen such a large dataset before.

However, this is a publicity stunt. There's no way AOL accidently packaged the exact data that Google refused to give to the DoJ and published it to their website. Back then, remember everyone was clamering for the SEs to release the same info to the public.

AOL thought they would use that sentiment (months late of course) to build press for the "new, open AOL". However, it quickly backfired and now their PR people have turned it into a mistake by researchers.

walkman

8:05 pm on Aug 7, 2006 (gmt 0)

Apparently, one is planning on killing his wife:
"Bloggers, picking up on the date, noted the search history for user 17556639 which showed that the individual conducted a number of searches for �how to kill your wife�, and was also keen to find images of �dead people�, �car crashes� and decapitation."
[ft.com...]

See the slippery slope of the Govt' asking for data? Is user 17556639 doing reaserch for a book, maybe he saw a CSI episode and followed up link after link, or is he planing to really kill his wife?

nuclei

8:29 pm on Aug 7, 2006 (gmt 0)

maherphil: The readme also says there are approx 20 million queries, when there are really ~36 million.

I would have to agree that this seems like it was a planned release and now that it got the wrong response, theres a need for damage control. But how much damage control can be done now that the data is everywhere.

blaze

8:40 pm on Aug 7, 2006 (gmt 0)

yeah, it's a good piont. I build virtual goods for virtual games and regularly google pictures for things like nuclear bombs, missile launcher, gatling guns, machine guns, etc.

I'm sure the NSA has a field day with me.

tedster

8:58 pm on Aug 7, 2006 (gmt 0)

I think it's important in assessing this firestorm to understand where this data was posted. It's an open research forum (note that the original page is a wiki page) -- described on the domain's home page this way:

With AOL Research, we are excited to introduce an open-research community
... AOL Research features an open web site where researchers can present ideas, ...
[research.aol.com...]
So this information could well have been posted by a researcher in the wider web community, and not necessarily a TimeWarner emplyeee -- a researcher who was unaware of any potential fallout. I tend to believe AOL on this one and do not suspect any intentional publicity stunt. Still, it's a good question to ask -- who had original access to that data in order to post it?
<It's almost impossibe for me to access research.aol.com
right now. Their server load must be immense.>

oneguy

9:38 pm on Aug 7, 2006 (gmt 0)

There's no way AOL accidently packaged the exact data that Google refused to give to the DoJ and published it to their website.

Is that why this is in the Google Search News forum?

Maybe I'm dreaming or something, but I don't remember ever wondering if I'm dreaming while actually dreaming. At the moment, all I'm thinking is woohoo!

gregbo

9:48 pm on Aug 7, 2006 (gmt 0)

Still, it's a good question to ask -- who had original access to that data in order to post it?

Not necessarily pertinent to the current situation, but in the past, SEs have been contacted by universities to use their access logs for research purposes. There is always the possibility that such logs can be obtained from the university and posted to public places.

jimbeetle

10:17 pm on Aug 7, 2006 (gmt 0)

Everything that the AOL spokesperson, Andrew Weinstein, is saying sounds like it's a screw up (his words) on AOL's part. They aren't blaming a third party.

Right Reading

10:51 pm on Aug 7, 2006 (gmt 0)

One reason that this is such a screw-up is that of course it is not too difficult in many instances to determine the identities of the searchers by looking at their pattern of searches.

tedster

10:59 pm on Aug 7, 2006 (gmt 0)

Admin note: I've been removing posts all day asking how to get a copy of the file. Please note what blaze said above:

This is mirrored all over the net, I'm afraid. It's even on bit torrent.

If you need to get up to speed on bittorrent first, check Wikipedia for more information. Please do not use this thread as a focus point for distributing the file itself.

D_Blackwell

12:13 am on Aug 8, 2006 (gmt 0)

This is going to be handy for getting a look at real world inquiries that competitors do well for. As yet, not seeing many surprises exactly, but still nice to have. It's a huge amount of data. Very interesing to watch how people modify their queries.

UserFriendly

12:34 am on Aug 8, 2006 (gmt 0)

So anyone who's done regular searches on their own name or address or domain is right up a certain creek without a paddle.

Nice thinking, AOL.

rollinj

5:58 am on Aug 8, 2006 (gmt 0)

Come to think of it.. who hasn't done a search on their own name?

I know most of my younger friends are all somewhat star struck and have searched up their own name at one time or another, usually using different variations to see what comes up.

Let's just hope they weren't paranoid too and typed in their SIN to see if they were susceptible to identity theft?

vincevincevince

8:30 am on Aug 8, 2006 (gmt 0)

Hands up if you're glad you don't use AOL to search?

trillianjedi

9:19 am on Aug 8, 2006 (gmt 0)

Admin note # 2 : what Tedster said:-

I've been removing posts all day asking how to get a copy of the file. Please note what blaze said above:
This is mirrored all over the net, I'm afraid. It's even on bit torrent.
If you need to get up to speed on bittorrent first, check Wikipedia for more information. Please do not use this thread as a focus point for distributing the file itself.

The file is all over the place - it's easier to find than a vi@gr@ SERP. The real meaty topic here is the fact that the data got spilled, not distributing it.

Thanks,

frakilk

9:27 am on Aug 8, 2006 (gmt 0)

Doing some searches on the data it is possible to find out the top keyphrases used to reach a particular domain. This affects not only AOL users but also websites that appear in AOL's SERPs.

Simply unbelievable AOL, lost for words here.

trillianjedi

9:39 am on Aug 8, 2006 (gmt 0)

lost for words here.

LOL. Pun intended?

frakilk

9:47 am on Aug 8, 2006 (gmt 0)

Heh I'd love to claim that it was trillianjedi but alas it was not :-)

I have a feeling that this fiasco is only starting. Imagine the repercussions once AOL users realise that this data is public. I mean I just did a sample search on one anonymous ID and 2153 results were returned. That is truly shocking.

vincevincevince

10:46 am on Aug 8, 2006 (gmt 0)

On the other hand...
...this opens the door to even better vanity searches

Not looking for what people say about you, but what people want to find out about you...

vincevincevince

11:30 am on Aug 8, 2006 (gmt 0)

Another aspect which will really hit AOL in the subsequent law suit:-

Anyone who's website has had hits recorded in this file can cross-reference them to find the real user's IP, and if it's a membership site - their user information.

e.g. Brett could find out exactly who user X is, and see their records. I can see lots of hits to webmasterworld.com in the files - and I bet many of them are from members.

I do not believe Brett would do such a thing, but not everyone is as ethical as Brett.
_{Apologies for the double post - missed the edit window :(}

This 70 message thread spans 3 pages: 70