Forum Moderators: Robert Charlton & goodroi
AOL Research apparently inadvertantly posted a gzipped tarball of 500,000 websites sampled including every keyword they are ranking for. This was first discovered at the url:
[research.aol.com...]
[edited by: engine at 5:06 pm (utc) on Aug. 7, 2006]
Is that why this is in the Google Search News forum?
Duh, nevermind. I think the excitement got to me.
Anyone who's website has had hits recorded in this file can cross-reference them to find the real user's IP, and if it's a membership site - their user information.
I don't happen to do this, but a friend of mine who is quite smart and has plenty to protect said that he accidentally types usernames / passwords into search boxes. Not really a big deal unless you look at all of his activity as a whole. Seems the dots would be easy to connect.
Other thing is the media have picked up on a lot of complaints about AOL's service, mainly the extreme problems canceling it. AOL staff hanging up on people who are simply trying to cancel the account of a dead spouse.
Releasing a dataset of that size is a huge stuff up, I can see how anyone could accidentially upload a 400+ meg file and link it to a page on the website, complete with instructions.
Whats next AOL, releasing your customers' addresses and phone numbers?
- The data is by AOL users, which is just a subset of the Internet users. This user group may be less experienced that average Internet users.
- The data is very much US focused. Non-US searches or sites may not be reflected well in this data set.
- The data is still a random sample, i.e. it does not cover all searches performed during the sampling period. Thus, rare search terms might not be representative in this sample. The sheer size of the data sample is interesting and (I bet) still representative for broader search terms or bigger sites.
- The data is limited to a certain timeframe (March-May), so a certain seasonal influence may appear (less searches for christmas gifts, Halloween, or Ski vacations).
What else do we have to keep in mind when discussing this data set?
So here is a record of data.
2281868 how destroy demons that live in apt above 2006-03-01 5 http://www.example.com
The last column is the SERP the user clicked from the 'how destroy demons that live in apt above' query.
Since the owner of www.example.com has log files, she can find the IP address of who typed that particular query on what day and then contact the ISP to find who had that IP to find out exactly who it is.
<Sorry, no specific domains.
See Forum Charter [webmasterworld.com]>
[edited by: tedster at 2:50 pm (utc) on Aug. 9, 2006]
Since the owner of www.example.com has log files, she can find the IP address of who typed that particular query on what day and then contact the ISP to find who had that IP to find out exactly who it is.
That's not my IP, I'm just borrowing it from a distant eastern european relative.
[edited by: MrMacphisto at 3:08 pm (utc) on Aug. 9, 2006]
Maybe this evidence wouldn't be good enough for use in court, but I bet it could cause some very interesting investigations to be opened.
The government also doesn't need to continue their lawsuit against Google to get at Google's search data as this more than covers what they wanted.
As others have pointed out, what this data proves is that "anonymous data" isn't so anonymous and that there is no way Google should be forced to release its search history to the government.
It is VERY difficult using just query terms to identify a particular searcher, which is why researchers have been struggling with personalization for nearly two decades.
There is no other way to get real world interaction data from a significant sample of Web users unless the search engine companies provide it to academic researchers.
Are there potential privacy concerns with such data releases? Yes. Are there potentially great benefits with such data releases? Yes.
See Queries For User 4417749 [nytimes.com]. Difficult is very subjective.
Didn't take long to indentify that woman, and with 3 months worth of data. What can be done with a year's worth of queries? Two years?
I identified a poster on another another forum I frequent pretty easily by noticing searches for that forum, where his kids to to school, where he went to high school, the neighborhood he lives in, sports related searches, and a cruise he went on. It was pretty shocking to read through the search history and realize "that's so-and-so".
And to stress the huge violation of privacy this is, I also know where he banks, who he has his mortgage through, what types of porn he likes, and when he likes to look for it.
John Battelle, author of the 2005 book "The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture," said the AOL misstep, while unfortunate, could have a silver lining if people begin to understand what is at stake. In his book, he says search engines are mining the priceless "database of intentions" formed by the world's search requests."It's only by these kinds of screw-ups and unintended behind-the-curtain views that we can push this dialogue along," Battelle said. "As unhappy as I am to see this data on people leaked, I'm heartened that we will have this conversation as a culture, which is long overdue."
Now I don't seem to be wearing a tin-foil hat to my famiy and friends anymore.
Didn't take long to indentify that woman, and with 3 months worth of data. What can be done with a year's worth of queries? Two years?
It is very difficult in general to identify people from their (anonymized) clickstreams. OTOH, it is very easy to identify someone who freely discloses information about themself.
People such as Lauren Weinstein from the PFIR have been pushing the SEs to delete query logs after a few months. However, such data can be valuable in determining whether click fraud, impression fraud, etc. has occurred.
1) how many people click on first result, second result, etc.
2) how many words are in a query
3) what your competitor's search terms are
4) what kind of person visits your site or your competitor's site.
"My goodness, it's my whole personal life. I had no idea somebody was looking over my shoulder."
That famous novel with the prophecies on such issues was released in 1948. I have always been wondering, how few people obviously read it, and how even fewer people seem to have understood it's brilliance. (OT: We have never been at war with the communist eurasian block, have we?)
I'd recon the very last sentence.
It's the reason why I always use my full name as a nick.
Don't even THINK of doing evil.
46.38% of searches show no clicks at all.
By Position
122.73%
26.40%
34.53%
43.24%
52.61%
62.14%
71.81%
81.60%
91.51%
101.59%
110.35%
120.30%
130.28%
140.26%
150.25%
160.21%
170.19%
180.18%
190.17%
200.16%
By Page
1st Page (1-10) - 48.16%
2nd Page (11-25) - 3.1%
3rd Page (26-40) - .93%
4th Page (41-50) - .44%
That's the best way to convince people that AOL is still a popular search website.
Personally when I see referals from AOL in the logs, I open a bottle of champaign and celebrate...I didn't get a chance to get drunk for a year or so though.
[edited by: Right_Reading at 2:09 am (utc) on Aug. 10, 2006]
If you want my opinion the whole thing is a public relation strategy. That's the best way to convince people that AOL is still a popular search website.
Nope. I heartily disagree.
If you were an AOL user and opened up the NYT, and you would read that article, now, would you think: "Wow. These guys at AOL are awesome, they collect all the data. Maybe I'm going to make it to the NYT one day." -or- "Uh, what about all this crap on security and privacy that AOL has been talking about for years? These guys were even charging a premium for the promise of safety. I'm going to end my membership!"
My bet is that the 2nd thing is more likely to happen. People do not like to see their data somewhere on the net. They will use AOL and AOL Search with less confidence than before, if at all.
To AOL, this is the ultimate nightmare. Forget about the job cuts, forget about the carve-out, *this* is the worst thing that could happen. Their product is affected, not just the ownership or business operations.
There would have been 1 million ways to get a more positive PR message across.
Well, just my $0.02
Google Inc. CEO Eric Schmidt said Wednesday the privacy concerns raised by that breach won't change his company's practice of storing the inquiries made by its users.Mountain View-based Google owns a 5 percent stake in AOL, which also accounted for about $330 million of the search engine's revenue during the first half of this year. AOL also depends on Google's algorithms for its search results.
MSNBC [msnbc.msn.com]
1. Type in a popular search term. Check to see which sites got clicked on. For example, why did SERP 3 get clicked on three times as much as SERP 1? Was it because SERP 1 was irrelevant...or was SERP three written FAR better than SERP one? Marketers can glean (after much research) what language/words work best to attract attention/clicks.
2. Searching patters of users.
3. New negative keywords for campaigns (Adwords) that you hadn't thought of before.
4. As already summarized....real statistical evidence of how many (on average) click on the first result as opposed to the 2nd/3rd/so on result. Proof that the result at the top of the second page gets more play than the result at the bottom of the first page (long been suspected)...
I'm sure there are dozens more uses for this data! Anyone else have any other ideas what this data would be useful for from a marketer/SEO's perspective?
Dave.