AOL "Angry & Upset" After Releasing Search Data

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

AOL "Angry & Upset" After Releasing Search Data

nuclei

2:53 am on Aug 7, 2006 (gmt 0)

Ouch! The title says it all.

AOL Research apparently inadvertantly posted a gzipped tarball of 500,000 websites sampled including every keyword they are ranking for. This was first discovered at the url:

[research.aol.com...]

[edited by: engine at 5:06 pm (utc) on Aug. 7, 2006]

oneguy

1:10 pm on Aug 8, 2006 (gmt 0)

Is that why this is in the Google Search News forum?

Duh, nevermind. I think the excitement got to me.

Anyone who's website has had hits recorded in this file can cross-reference them to find the real user's IP, and if it's a membership site - their user information.

I don't happen to do this, but a friend of mine who is quite smart and has plenty to protect said that he accidentally types usernames / passwords into search boxes. Not really a big deal unless you look at all of his activity as a whole. Seems the dots would be easy to connect.

fabricator

4:29 am on Aug 9, 2006 (gmt 0)

I can see lots of users going through AOHell.

Other thing is the media have picked up on a lot of complaints about AOL's service, mainly the extreme problems canceling it. AOL staff hanging up on people who are simply trying to cancel the account of a dead spouse.

Releasing a dataset of that size is a huge stuff up, I can see how anyone could accidentially upload a 400+ meg file and link it to a page on the website, complete with instructions.

Whats next AOL, releasing your customers' addresses and phone numbers?

mzanzig

8:57 am on Aug 9, 2006 (gmt 0)

OK - let's value the data that was presented by AOL.

- The data is by AOL users, which is just a subset of the Internet users. This user group may be less experienced that average Internet users.

- The data is very much US focused. Non-US searches or sites may not be reflected well in this data set.

- The data is still a random sample, i.e. it does not cover all searches performed during the sampling period. Thus, rare search terms might not be representative in this sample. The sheer size of the data sample is interesting and (I bet) still representative for broader search terms or bigger sites.

- The data is limited to a certain timeframe (March-May), so a certain seasonal influence may appear (less searches for christmas gifts, Halloween, or Ski vacations).

What else do we have to keep in mind when discussing this data set?

blaze

12:29 pm on Aug 9, 2006 (gmt 0)

Who generated the SERPs for this data? Is this an AOL engine or a Google engine?

cpnmm

1:19 pm on Aug 9, 2006 (gmt 0)

What sort of software would make this reverse look up of search information impossible? Would something like Anonymizer do this or would you just leave a trail of fake IP that could just as easily identify you?

maherphil

2:48 pm on Aug 9, 2006 (gmt 0)

no software, let me explain.

So here is a record of data.

2281868 how destroy demons that live in apt above 2006-03-01 5 http://www.example.com

The last column is the SERP the user clicked from the 'how destroy demons that live in apt above' query.

Since the owner of www.example.com has log files, she can find the IP address of who typed that particular query on what day and then contact the ISP to find who had that IP to find out exactly who it is.

<Sorry, no specific domains.
See Forum Charter [webmasterworld.com]>

[edited by: tedster at 2:50 pm (utc) on Aug. 9, 2006]

MrMacphisto

3:02 pm on Aug 9, 2006 (gmt 0)

Since the owner of www.example.com has log files, she can find the IP address of who typed that particular query on what day and then contact the ISP to find who had that IP to find out exactly who it is.

That's not my IP, I'm just borrowing it from a distant eastern european relative.

[edited by: MrMacphisto at 3:08 pm (utc) on Aug. 9, 2006]

MrMacphisto

3:10 pm on Aug 9, 2006 (gmt 0)

Who generated the SERPs for this data? Is this an AOL engine or a Google engine?

Has anyone been able to compare their SERP data with the AOL data yet?
(I'm still searching for my archives)

KenB

3:22 pm on Aug 9, 2006 (gmt 0)

I find it interesting that nobody has really discussed how the government (particularly law enforcement) could make use of this data. CNet News has an interesting article on this and compiled some very disturbing search histories on some particular users.

Maybe this evidence wouldn't be good enough for use in court, but I bet it could cause some very interesting investigations to be opened.

The government also doesn't need to continue their lawsuit against Google to get at Google's search data as this more than covers what they wanted.

As others have pointed out, what this data proves is that "anonymous data" isn't so anonymous and that there is no way Google should be forced to release its search history to the government.

TypicalSurfer

3:26 pm on Aug 9, 2006 (gmt 0)

Its not the gov't you need to worry about, its the SEs/marketers who will do the real damage.

Right Reading

3:34 pm on Aug 9, 2006 (gmt 0)

The New York Times has already put a name to one of the searchers, who said, "My goodness, it's my whole personal life. I had no idea somebody was looking over my shoulder."

jjansen

3:34 pm on Aug 9, 2006 (gmt 0)

As researcher who has employed search engine transaction logs in research projects for nearly a decade, the concerns about the AOL data release are out of proportion to reality.

It is VERY difficult using just query terms to identify a particular searcher, which is why researchers have been struggling with personalization for nearly two decades.

There is no other way to get real world interaction data from a significant sample of Web users unless the search engine companies provide it to academic researchers.

Are there potential privacy concerns with such data releases? Yes. Are there potentially great benefits with such data releases? Yes.

digitalghost

3:44 pm on Aug 9, 2006 (gmt 0)

>>It is VERY difficult using just query terms to identify a particular searcher

See Queries For User 4417749 [nytimes.com]. Difficult is very subjective.

Didn't take long to indentify that woman, and with 3 months worth of data. What can be done with a year's worth of queries? Two years?

woop01

3:56 pm on Aug 9, 2006 (gmt 0)

Nope, not too difficult at all for some users.

I identified a poster on another another forum I frequent pretty easily by noticing searches for that forum, where his kids to to school, where he went to high school, the neighborhood he lives in, sports related searches, and a cruise he went on. It was pretty shocking to read through the search history and realize "that's so-and-so".

And to stress the huge violation of privacy this is, I also know where he banks, who he has his mortgage through, what types of porn he likes, and when he likes to look for it.

tedster

4:07 pm on Aug 9, 2006 (gmt 0)

In the NYT story linked by digitalghost above, John Battelle [pubcon.com] summed up my feelings pretty well:

John Battelle, author of the 2005 book "The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture," said the AOL misstep, while unfortunate, could have a silver lining if people begin to understand what is at stake. In his book, he says search engines are mining the priceless "database of intentions" formed by the world's search requests.
"It's only by these kinds of screw-ups and unintended behind-the-curtain views that we can push this dialogue along," Battelle said. "As unhappy as I am to see this data on people leaked, I'm heartened that we will have this conversation as a culture, which is long overdue."

Now I don't seem to be wearing a tin-foil hat to my famiy and friends anymore.

gregbo

9:27 pm on Aug 9, 2006 (gmt 0)

Didn't take long to indentify that woman, and with 3 months worth of data. What can be done with a year's worth of queries? Two years?

It is very difficult in general to identify people from their (anonymized) clickstreams. OTOH, it is very easy to identify someone who freely discloses information about themself.

People such as Lauren Weinstein from the PFIR have been pushing the SEs to delete query logs after a few months. However, such data can be valuable in determining whether click fraud, impression fraud, etc. has occurred.

sun818

9:32 pm on Aug 9, 2006 (gmt 0)

The raw data itself in its text form is about 2 gigs. You definitely can't work with the entire data set in Microsoft Access due to file size limitations. I think you'd have to work with a database server (e.g. mySQL, SQL Server, Oracle, etc). If you could work with the entire data set, it'd be interesting to see:

1) how many people click on first result, second result, etc.
2) how many words are in a query
3) what your competitor's search terms are
4) what kind of person visits your site or your competitor's site.

Oliver Henniges

9:48 pm on Aug 9, 2006 (gmt 0)

"My goodness, it's my whole personal life. I had no idea somebody was looking over my shoulder."

That famous novel with the prophecies on such issues was released in 1948. I have always been wondering, how few people obviously read it, and how even fewer people seem to have understood it's brilliance. (OT: We have never been at war with the communist eurasian block, have we?)

I'd recon the very last sentence.
It's the reason why I always use my full name as a nick.

Don't even THINK of doing evil.

woop01

9:55 pm on Aug 9, 2006 (gmt 0)

Yeah, you do need an large scale server to handle it. I only put about 80% of the data into my server but here's the answer to your first question...

46.38% of searches show no clicks at all.

By Position

122.73%
26.40%
34.53%
43.24%
52.61%
62.14%
71.81%
81.60%
91.51%
101.59%
110.35%
120.30%
130.28%
140.26%
150.25%
160.21%
170.19%
180.18%
190.17%
200.16%

By Page

1st Page (1-10) - 48.16%
2nd Page (11-25) - 3.1%
3rd Page (26-40) - .93%
4th Page (41-50) - .44%

slade7

10:19 pm on Aug 9, 2006 (gmt 0)

I got about 13 million rows of it - mainly to see if mysql could swallow it - and it seems to work pretty handily on a local windows box.

Some pretty disturbing stuff in there - but then we are talking about aol users.

sun818

10:27 pm on Aug 9, 2006 (gmt 0)

By Position
122.73%
26.40%
34.53%

Can you explain what you mean by this?

I am thinking the percentages should be under 100% as a count of all rows whether you include or exclude unclicked rows.

netchicken1

10:47 pm on Aug 9, 2006 (gmt 0)

Now that people are being identified from the data, does this open AOL to lawsuits for publishing private information on the net?

If so, it could be scarily expensive with even a small % of over 60 thousand people

mitomac

11:34 pm on Aug 9, 2006 (gmt 0)

It is VERY difficult using just query terms to identify a particular searcher

It's all there and easy to find.

For example:

NNN-NN-NNNN

zgrep -h '[[:digit:]]\{3\}-[[:digit:]]\{2\}-[[:digit:]]\{4\}' *.gz

I feel bad for these people.

woop01

12:23 am on Aug 10, 2006 (gmt 0)

Sun, sorry about that, I didn't notice the tab got left out...

1 - 22.73%
2 - 6.40%
3 - 4.53%
4 - 3.24%
5 - 2.61%
6 - 2.14%
7 - 1.81%
8 - 1.60%
9 - 1.51%
10 - 1.59%
11 - 0.35%
12 - 0.30%
13 - 0.28%
14 - 0.26%
15 - 0.25%
16 - 0.21%
17 - 0.19%
18 - 0.18%
19 - 0.17%
20 - 0.16%

followgreg

1:15 am on Aug 10, 2006 (gmt 0)

If you want my opinion the whole thing is a public relation strategy.

That's the best way to convince people that AOL is still a popular search website.

Personally when I see referals from AOL in the logs, I open a bottle of champaign and celebrate...I didn't get a chance to get drunk for a year or so though.

Right Reading

1:40 am on Aug 10, 2006 (gmt 0)

One interesting, if not surprising, thing is how ill-formed the bulk of the queries are. Most people appear not to have a clue about searching and just start typing whatever pops into their head. This lends a large element of chance to the whole venture. Presumably their click-throughs are equally, uh, spontaneous. No wonder fewer than 1 percent make it to the third page of results.

[edited by: Right_Reading at 2:09 am (utc) on Aug. 10, 2006]

mzanzig

5:04 am on Aug 10, 2006 (gmt 0)

Greg:

If you want my opinion the whole thing is a public relation strategy. That's the best way to convince people that AOL is still a popular search website.

Nope. I heartily disagree.

If you were an AOL user and opened up the NYT, and you would read that article, now, would you think: "Wow. These guys at AOL are awesome, they collect all the data. Maybe I'm going to make it to the NYT one day." -or- "Uh, what about all this crap on security and privacy that AOL has been talking about for years? These guys were even charging a premium for the promise of safety. I'm going to end my membership!"

My bet is that the 2nd thing is more likely to happen. People do not like to see their data somewhere on the net. They will use AOL and AOL Search with less confidence than before, if at all.

To AOL, this is the ultimate nightmare. Forget about the job cuts, forget about the carve-out, *this* is the worst thing that could happen. Their product is affected, not just the ownership or business operations.

There would have been 1 million ways to get a more positive PR message across.

Well, just my $0.02

herb

4:24 pm on Aug 10, 2006 (gmt 0)

Google to Keep Storing Search Requests

Google Inc. CEO Eric Schmidt said Wednesday the privacy concerns raised by that breach won't change his company's practice of storing the inquiries made by its users.
Mountain View-based Google owns a 5 percent stake in AOL, which also accounted for about $330 million of the search engine's revenue during the first half of this year. AOL also depends on Google's algorithms for its search results.

MSNBC [msnbc.msn.com]

davewray

4:35 pm on Aug 10, 2006 (gmt 0)

This is a veritable Goldmine! Wow, gold for marketers and SEO's alike. Here are a few things that can be gleaned (with some work)...

1. Type in a popular search term. Check to see which sites got clicked on. For example, why did SERP 3 get clicked on three times as much as SERP 1? Was it because SERP 1 was irrelevant...or was SERP three written FAR better than SERP one? Marketers can glean (after much research) what language/words work best to attract attention/clicks.

2. Searching patters of users.

3. New negative keywords for campaigns (Adwords) that you hadn't thought of before.

4. As already summarized....real statistical evidence of how many (on average) click on the first result as opposed to the 2nd/3rd/so on result. Proof that the result at the top of the second page gets more play than the result at the bottom of the first page (long been suspected)...

I'm sure there are dozens more uses for this data! Anyone else have any other ideas what this data would be useful for from a marketer/SEO's perspective?

Dave.

davewray

7:47 pm on Aug 10, 2006 (gmt 0)

Another use is by performing the "random" search of users you can get some insight into some relatively untouched, lucrative niches....ones you may have never even thought about...

This 70 message thread spans 3 pages: 70