September 16, 2009

Google Buys reCAPTCHA, Creating a Potential Privacy Issue

Greetings. Google has announced their acquiring of Carnegie Mellon University's "reCAPTCHA" system. You've no doubt seen reCAPTCHA in action -- it is very widely used by a vast array of sites. CMU's reCAPTCHA is a specific implementation of the more generalized CAPTCHA concept, which attempts to validate user input as coming from a human, not a (typically spam-related) robot.

The reCAPTCHA system presents pairs of words optically scanned from books, and asks the user to identify them. In the process, it also uses the resulting data to help "decode" those scanned words into their correct machine-readable textual representations as part of larger book scanning efforts.

This obviously makes reCAPTCHA a perfect match for Google, who is faced with the challenge of processing vast numbers of books in their Google Books project, some of which have fairly high OCR (Optical Character Recognition) error rates due to the difficulty of machine recognition of odd fonts, faded printing, and so on.

However, there is a potential privacy problem with reCAPTCHA (or any centralized CAPTCHA system, for that matter), that Google will need to face.

Early this year, while in the process of setting up an Internet-based forum, I considered using reCAPTCHA as part of the validation procedures. Since centralized CAPTCHA servers will typically collect IP address and potentially other data from users at the time of page display, and again when users interact with the CAPTCHA systems (for registration, message sending, etc.), these servers receive a running log of information regarding the users of the sites who are incorporating those CAPTCHAs into their pages.

So I was very surprised to discover that I could not find any reCAPTCHA privacy policy explaining to ordinary Web users displaying those pages, or interacting with the reCAPTCHA system, how that collected data would be handled from a privacy and data protection standpoint.

I queried CMU about this, and the reCAPTCHA support team replied that they did have an extensive privacy policy, but that it only appeared when reCAPTCHA API keys were created -- that is, when a Web site administrator wanting to incorporate reCAPTCHA into a site applied for reCAPTCHA access. There was nothing to tell conventional users how their IP address or other data would be handled by reCAPTCHA as a result of their viewing or interacting with a Web site page incorporating reCAPTCHA functionalities -- that is, no privacy policy to be found at all for those users at that time. Partly for this reason, I chose not to use reCAPTCHA for my forum.

With reCAPTCHA moving under the Google umbrella, it will be crucial that Google clearly explain, in a visible and specific privacy policy, how they will collect, correlate, and otherwise use IP address and other data associated with reCAPTCHA display and use.

Fundamentally, this situation is similar to that with ad display systems, where the very act of viewing a page that includes external ads may pass IP address info (and sometimes other data) to third parties. However, while Web users can usually choose to block external ads in various ways if they wish (something I do not recommend or promote -- see Blocking Web Ads -- And Paying the Piper), blocking CAPTCHAs would usually mean losing access to the associated sites in significant ways.

As an enthusiastic supporter of Google Books (The Joy of Libraries, a Fireman's Flame, and the Google Books Settlement), I fully appreciate the value that reCAPTCHA will bring to Google, and ultimately to all users of Google Books.

But I also believe that it's very important for the privacy issues associated with reCAPTCHA to be properly handled by Google, hopefully in a manner significantly better than Carnegie Mellon's own approach earlier this year.


Posted by Lauren at September 16, 2009 01:43 PM | Permalink
Twitter: @laurenweinstein
Google+: Lauren Weinstein