Yesterday, in A Terrible Decision by the Internet Archive May Lead to Widespread Blocking, I discussed in detail why the Internet Archive’s decision to ignore Robots Exclusion Standard (RES) directives (in robots.txt files on websites) is terrible for the Internet community and users. I had expected a deluge of hate email in response. But I’ve received no negative reactions at all — rather a range of useful questions and comments — perhaps emphasizing the fact that the importance of the RES is widely recognized.
As I did yesterday, I’ll emphasize again here that the Archive has done a lot of good over many years, that it’s been an extremely valuable resource in more ways than I have time to list right now. Nor am I asserting that the Archive itself has evil motives for its decision. However, I strongly feel that their decision allies them with the dark players of the Net, and gives such scumbags comfort and encouragement.
One polite public message that I received was apparently authored by Internet Archive founder Brewster Kahle (since the message came in via my blog, I have not been able to immediately authenticate it, but the IP address seemed reasonable). He noted that the Archive accepts requests via email to have pages excluded.
This is of course useful, but entirely inadequate.
Most obviously, this technique fails miserably at scale. The whole point of the RES is to provide a publicly inspectable, unified and comprehensively defined method to inform other sites (individually, en masse, or in various combinations) of your site access determinations.
The “send an email note to this address” technique just can’t fly at Internet scale, even if users assume that those emails will ever actually be viewed at any given site. (Remember when “postmaster@” addresses would reliably reach human beings? Yeah, a long, long time ago.)
There’s also been some fascinating discussion regarding the existing legal status of the RES. While it apparently hasn’t been specifically tested in a legal sense here in the USA at least, judges have still been recognizing the importance of RES in various court decisions.
In 2006, Google was sued (“Field vs. Google” — Nevada) for copyright infringement for spidering and caching a website. The court found for Google, noting that the site included a robots.txt file that permitted such access by Google.
The case of Century 21 vs. Zoocasa (2011 — British Columbia) is also illuminating. In this case, the judge found against Zoocasa, noting that they had disregarded robots.txt directives that prohibited their copying content from the Century 21 site.
So it appears that even today, ignoring RES robots.txt files could mean skating on very thin ice from a legal standpoint.
The best course all around would be for the Internet Archive to reverse their decision, and pledge to honor RES directives, as honorable players in the Internet ecosystem are expected to do. It would be a painful shame if the wonderful legacy of the Internet Archive were to be so seriously tarnished going forward by a single (but very serious) bad judgment call.