June 16, 2011

Why Data Anonymization is Often Preferable to Data Deletion

Earlier today in some other venues, I sent out a brief pointer to a new report from the Ontario, Canada Information & Privacy Commissioner, a study that explains how -- contrary to some recent memes that have received a lot of attention -- it is possible to anonymize data in ways that are not conducive to "re-identification" abuses.

Several readers have asked me why I consider this report to be important, recommended reading. "Why not just delete the data and be done with it?" they're essentially asking.

While it may seem neat and clean to just quickly delete data that has the theoretical potential to be misused, that really is far too simplistic an approach.

A primary way that we learn is by studying our own past. This applies in many aspects of life -- with the sort of data under discussion being only one example.

Web activity log data can be crucial to the forensic analysis of system errors, failures, illicit access events (and attempts) -- all of which themselves may have significant privacy-related implications. If we don't have enough detailed information to study, particularly in terms of event sequencing and interactions over time, solving such problems and protecting against future such events can be extremely difficult, in some cases perhaps impossible.

In the health field, longitudinal (long-term) studies need ways to analyze data in myriad forms and combinations, but obviously, we also want to protect patient privacy appropriately.

Search quality -- finding the things that we want on the Web -- is a rapidly evolving science and art, which would be hobbled in major ways if it were not possible to study the kinds of searches and search patterns in which users engage. Such a "data starved" state of affairs would be to the detriment of search service users in short order.

And these are just a few examples of why quickly disposing of data is in many cases impractical, undesirable, or both.

Fundamentally, to approach this area reasonably we need to consider retained data "life cycles" in context.

There are some situations -- such as an anonymous tip line -- where to operate legitimately no data regarding caller identities typically should be maintained at all.

But in most cases involving conventional Web services, the need to maintain completely intact data (e.g. server log records) progressively decreases with the passage of time, which suggests that an appropriate approach is a defined process of gradually anonymizing various data elements via suitable techniques and algorithms, while still maintaining for as long as possible enough structurally detailed intact and "hashed" data fields to permit continuing analysis and study for as long as possible.

This is in fact the way that many firms' data life-cycle retention policies do operate.

Having appropriate policies in place to deal with these issues is crucial of course, and needs to extend to longer-term backup and archival aspects as well.

Note, however, that various governments around the world, including increasingly the U.S., have a rather "schizoid" view of these issues, simultaneously pressing for companies to delete various user data, but also to retain much other data (in fully identifiable, non-anonymized forms) to be delivered on demand to government agencies for retrospective surveillance and analysis by law enforcement and intelligence operations.

So we see that, as usual, these are complicated matters, indeed.

But it does seem clear that there are many situations where appropriate, effective data anonymization is not only extremely useful for services and users alike, but obviously superior to simplistic calls for the rapid deletion of data in its entirety.

With the increasing evidence that reports of anonymization's "death" have been (as Mark Twain would have said) "greatly exaggerated," we can continue to move forward toward the best technical and policy approaches for handling retained data, that maximize that data's potential for improving our lives, while simultaneously minimizing the risks of it being abused.

--Lauren--

Posted by Lauren at June 16, 2011 04:04 PM | Permalink
Twitter: @laurenweinstein
Google+: Lauren Weinstein