Greetings. I've written and spoken many times about the sensitivity of search engine query data. We all know about Google's stance in DOJ vs. Google early this year, where Google wisely attempted (for several reasons) to prevent release of such data to a government fishing expedition related to "child protection" legislation. We also know that Gonzales, et al. are merrily pushing mandated data retention laws -- again mainly in the name of child protection -- that would leave Internet users vulnerable to all manner of unreasonable surveillance of their Internet activities. All of this is already enough to be sounding alarm bells regarding the lack of reasonable legislated protections for such data.
The AOL action in releasing the search records of a reported 500K AOL users -- assuming it took place as outlined below -- is probably the most egregious violation of users' search privacy in the history of the Internet, despite the half-hearted attempt at crude anonymization. The unbelievable lack of responsibility or good judgment shown by AOL in this case should be enough to cause any remaining AOL subscribers (or users of their free services) to strongly consider ceasing any further contact with AOL.
Furthermore, we need to accept the fact that search query data is incredibly sensitive and often contains extremely personal information that does not lose its potential for abuse via simplisitic forms of anonymization. Nor can we necessarily depend indefinitely on some individual search engines' (e.g. Google) honest and praiseworthy desires to protect such data in the face of intense competition and intrusive government actions.
Search query data can contain the sum total of our work, interests, associations, desires, dreams, fantasies, and even darkest fears.
We must demand that this data be protected.
Subsequent information has revealed that more than 600K users' search data was included in the AOL release.
I have altered the URL reference (3) from the forwarded message below. Anyone who tried to forward that original message to an AOL user may have been in for a surprise.
At least in my experiments just now, AOL rejects that message since URL reference (3) contained a numeric IP address rather than a domain address.
Ironic, isn't it? AOL "protects" users by blocking messages with IP addresses in URLs (can such addresses be suspect? Yeah, but they can easily be legit, too) -- yet they happily release the most private aspects of users' search activities.
It's like a Fellini movie over there, but much less amusing.
Begin forwarded message:
From: Seth Finkelstein
AOL Releases Search Logs from 500,000 Users
AOL just released the logs of all searches done by 500,000 of their users over the course of three months earlier this year. That means that if you happened to be randomly chosen as one of these users, everything you searched for from March to May (2006) is now public information on the internet.
This was not a leak - it was intentional. In their desperation to gain recognition from the research community, AOL decided they would compromise their integrity to provide a data set that might become often-cited in research papers: "Please reference the following publication when using this collection..." is the message before the download.
This is a blatant violation of users' privacy. The data is "anonymized", which to AOL means that each screenname was replaced with a unique number. "It is still a research question how much information needs to be anonymized to protect users," says Abdur from AOL. Here are some examples of what you can find in the data:
User 491577 searches for "florida cna pca lakeland tampa", "emt school training florida", "low calorie meals", "infant seat", and "fisher price roller blades". Among user 39509's hundreds of searches are: "ford 352", "oklahoma disciplined pastors", "oklahoma disciplined doctors", "home loans", and some other personally identifying and illegal stuff I'm going to leave out of here. Among user 545605's searches are "shore hills park mays landing nj", "frank william sindoni md", "ceramic ashtrays", "transfer money to china", and "capital gains on sale of house". Compared to some of the data, these examples are on the safe side. I'm leaving out the worst of it - searches for names of specific people, addresses, telephone numbers, illegal drugs, and more. There is no question that law enforcement, employers, or friends could figure out who some of these people are.
I hope others can find more examples in the data, which is up for download over here. The data set is very large when uncompressed which makes it pretty hard to work with, but someone should set up a web interface so people can browse it (or even 10% of it) without having to download the 400mb file. If you make a mirror or better interface to the data, or find other examples, let me know and I'll put a link up here.
This is the same data that the DOJ wanted from Google back in March. This ruling allowed Google to keep all query logs secret. Now any government can just go download the data from AOL.
It's unclear if this is the type of data AOL released to the government back when Google refused to comply. If nothing else, this should be a good example of why search history needs strong privacy protection.
Thanks to Greg Linden for pointing this out here.
Update 2: The md5 of the file AOL posted (and now removed) is 31cd27ce12c3a3f2df62a38050ce4c0a. I'm posting it so you can make sure you have a valid copy, but so far none of the copies I've seen are fake.
Update: Seems like AOL took it down. There are some mirrors of the data in the comments of the digg story, linked below. I estimate about 1000 people have the file, so it's definitely going to be circulated around. The main AOL research page is still up, with some other data collections. The google cache of the download page is still up, but you can't get the data. Here's discussion at other sites:
Seth Finkelstein Consulting Programmer http://sethf.com