[Infowarrior] - AOL Releases Search Logs from 500,000 Users

Richard Forno rforno at infowarrior.org
Mon Aug 7 09:49:40 EDT 2006


(it's floating around on torrent and other mirrors, I've seen, too....rf)

AOL Releases Search Logs from 500,000 Users
http://www.ugcs.caltech.edu/~dangelo/aol-search-query-logs/

Update 2: The md5 of the file AOL posted (and now removed) is
31cd27ce12c3a3f2df62a38050ce4c0a. I'm posting it so you can make sure you
have a valid copy, but so far none of the copies I've seen are fake.

Update: Seems like AOL took it down. There are some mirrors of the data in
the comments of the digg story, linked below. I estimate about 1000 people
have the file, so it's definitely going to be circulated around. The main
AOL research page is still up, with some other data collections. The google
cache of the download page is still up, but you can't get the data. Here's
discussion at other sites:

    * siliconbeat
    * techcrunch
    * digg
    * reddit
    * zoli's blog 

AOL just released the logs of all searches done by 500,000 of their users
over the course of three months earlier this year. That means that if you
happened to be randomly chosen as one of these users, everything you
searched for from March to May (2006) is now public information on the
internet.

This was not a leak - it was intentional. In their desperation to gain
recognition from the research community, AOL decided they would compromise
their integrity to provide a data set that might become often-cited in
research papers: "Please reference the following publication when using this
collection..." is the message before the download.

This is a blatant violation of users' privacy. The data is "anonymized",
which to AOL means that each screenname was replaced with a unique number.
"It is still a research question how much information needs to be anonymized
to protect users," says Abdur from AOL. Here are some examples of what you
can find in the data:

User 491577 searches for "florida cna pca lakeland tampa", "emt school
training florida", "low calorie meals", "infant seat", and "fisher price
roller blades". Among user 39509's hundreds of searches are: "ford 352",
"oklahoma disciplined pastors", "oklahoma disciplined doctors", "home
loans", and some other personally identifying and illegal stuff I'm going to
leave out of here. Among user 545605's searches are "shore hills park mays
landing nj", "frank william sindoni md", "ceramic ashtrays", "transfer money
to china", and "capital gains on sale of house". Compared to some of the
data, these examples are on the safe side. I'm leaving out the worst of it -
searches for names of specific people, addresses, telephone numbers, illegal
drugs, and more. There is no question that law enforcement, employers, or
friends could figure out who some of these people are.

I hope others can find more examples in the data, which is up for download
over here. The data set is very large when uncompressed which makes it
pretty hard to work with, but someone should set up a web interface so
people can browse it (or even 10% of it) without having to download the
400mb file. If you make a mirror or better interface to the data, or find
other examples, let me know and I'll put a link up here.

This is the same data that the DOJ wanted from Google back in March. This
ruling allowed Google to keep all query logs secret. Now any government can
just go download the data from AOL.

It's unclear if this is the type of data AOL released to the government back
when Google refused to comply. If nothing else, this should be a good
example of why search history needs strong privacy protection.

Thanks to Greg Linden for pointing this out here. 




More information about the Infowarrior mailing list