Yesterday, in A Terrible Decision by the Internet Archive May Lead to Widespread Blocking, I discussed in detail why the Internet Archive’s decision to ignore Robots Exclusion Standard (RES) directives (in robots.txt files on websites) is terrible for the Internet community and users. I had expected a deluge of hate email in response. But I’ve received no negative reactions at all — rather a range of useful questions and comments — perhaps emphasizing the fact that the importance of the RES is widely recognized.
As I did yesterday, I’ll emphasize again here that the Archive has done a lot of good over many years, that it’s been an extremely valuable resource in more ways than I have time to list right now. Nor am I asserting that the Archive itself has evil motives for its decision. However, I strongly feel that their decision allies them with the dark players of the Net, and gives such scumbags comfort and encouragement.
One polite public message that I received was apparently authored by Internet Archive founder Brewster Kahle (since the message came in via my blog, I have not been able to immediately authenticate it, but the IP address seemed reasonable). He noted that the Archive accepts requests via email to have pages excluded.
This is of course useful, but entirely inadequate.
Most obviously, this technique fails miserably at scale. The whole point of the RES is to provide a publicly inspectable, unified and comprehensively defined method to inform other sites (individually, en masse, or in various combinations) of your site access determinations.
The “send an email note to this address” technique just can’t fly at Internet scale, even if users assume that those emails will ever actually be viewed at any given site. (Remember when “postmaster@” addresses would reliably reach human beings? Yeah, a long, long time ago.)
There’s also been some fascinating discussion regarding the existing legal status of the RES. While it apparently hasn’t been specifically tested in a legal sense here in the USA at least, judges have still been recognizing the importance of RES in various court decisions.
In 2006, Google was sued (“Field vs. Google” — Nevada) for copyright infringement for spidering and caching a website. The court found for Google, noting that the site included a robots.txt file that permitted such access by Google.
The case of Century 21 vs. Zoocasa (2011 — British Columbia) is also illuminating. In this case, the judge found against Zoocasa, noting that they had disregarded robots.txt directives that prohibited their copying content from the Century 21 site.
So it appears that even today, ignoring RES robots.txt files could mean skating on very thin ice from a legal standpoint.
The best course all around would be for the Internet Archive to reverse their decision, and pledge to honor RES directives, as honorable players in the Internet ecosystem are expected to do. It would be a painful shame if the wonderful legacy of the Internet Archive were to be so seriously tarnished going forward by a single (but very serious) bad judgment call.
4 thoughts on “More Regarding a Terrible Decision by the Internet Archive”
The archive might do well to author and publish a code-block to be included in many robots.txt by Webmasters, much like that one that saved Google.
Fold around it a tutorial about RES and doing it right, etc.
Thus their mission is served, and site owners are in control, when RES someday becomes mandatory in law (y’know it will come to that.)
The thing is that site owners should not have to treat the Internet Archive differently than any other site. If a robots.txt file specifies broad spidering prohibitions, it should not be necessary to include special directives just for the Archive. Special cases simply won’t scale. I do agree with you that the probability is quite high that adherence to RES (or something very much like it) will become mandatory by law around the world over time.
It might be a little late for me to say this because I just found out about this some time ago.
Never crossed my mind that I’d have to block Archive bots the hard way. But this is what happen when good organization tries to enforce their own rules to someone else’s backyard (web sites).
Yes, Archive website provides valuable tool for anyone who’s looking for “historical data” of any web sites it’s archived. But don’t they ever think that there will always be someone out there that prefer to not be archived by any means for any reasons?
From their blog post, one of the reason they choose to do this is because when domain not renewed and then the new owner (or simply parked domain page) blocking them, they lost all that domain data.
It’s a good reason. But how about indexing the pages by following the robots.txt rules placed on the day they archive the pages and if somehow the new domain owner in the future doesn’t like it, they have to contact archive.org directly? I believe that’s one of the most optimal solution for this issue. But instead they choose to go the hard route of ignoring rules placed by the website owners and if they don’t like it they have to contact Archive directly. That’s not good. Totally not good. That’s not what a good organization should do.
Sorry about the long post. It’s just I couldn’t believe myself that an organization that tries to do good deed would do something unbelievable like this.
Quoting my own comment text from my original blog post thread on this topic:
There are several different ways to look at this aspect. For example, the domain this blog in on was one of the first 40 dot-com domains ever issued — I’ve been using it continuously since then for just over 30 years. So nobody other than me has ever published material under this domain.
But cases of domain transfers get complex very quickly. While it’s true that there are bad players who attempt to hide good data, there are also good players who can be unfairly chastised by earlier bad data that has nothing whatsoever to do with them. For example, why should someone who buys a vacant domain in good faith be forever branded by some slimeball who had nasty material on the domain many years earlier? By analogy, when you buy a house, you’re not forced to put up a sign that says “Criminals lived here more than ten years ago!”
No easy answers to this class of domain issues.
Comments are closed.