UPDATE (23 April 2017): More Regarding a Terrible Decision by the Internet Archive
– – –
We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years — for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
For any given site, the decision to participate or not with the web scanning systems at the Internet Archive (or associated with any other “spidering” system) is indicated by use of the well established and very broadly affirmed “Robots Exclusion Standard” (RES) — a methodology that uses files named “robots.txt” to inform visiting scanning systems which parts of a given website should or should not be subject to spidering and/or archiving by automated scanners.
RES operates on the honor system. It requests that spidering systems follow its directives, which may be simple or detailed, depending on the situation — with those detailed directives defined comprehensively in the standard itself.
While RES generally has no force of law, it has enormous legal implications. The existence of RES — that is, a recognized means for public sites to indicate access preferences — has been important for many years to help hold off efforts in various quarters to charge search engines and/or other classes of users for access that is free to everyone else. The straightforward argument that sites already have a way — via the RES — to indicate their access preferences has held a lot of rabid lawyers at bay.
And there are lots of completely legitimate reasons for sites to use RES to control spidering access, especially for (but by no means restricted to) sites with limited resources. These include technical issues (such as load considerations relating to resource-intensive databases and a range of other related situations), legal issues such as court orders, and a long list of other technical and policy concerns that most of us rarely think about, but that can be of existential importance to many sites.
Since adherence to the RES has usually been considered to be voluntary, an argument can be made (and we can pretty safely assume that the Archive’s reasoning falls into this category one way or another) that since “bad” players might choose to ignore the standard, this puts “good” players who abide by the standard at a disadvantage.
But this is a traditional, bogus argument that we hear whenever previously ethical entities feel the urge to start behaving unethically: “Hell, if the bad guys are breaking the law with impunity, why can’t we as well? After all, our motives are much better than theirs!”
Therein are the storied paths of “good intentions” that lead to hell, when the floodgates of such twisted illogic open wide, as a flood of other players decide that they must emulate the Internet Archive’s dismal reasoning to remain competitive.
There’s much more.
While RES is typically viewed as not having legal force today, that could be changed, perhaps with relative ease in many circumstances. There are no obvious First Amendment considerations in play, so it would seem quite feasible to roll “Adherence to properly published RES directives” into existing cybercrime-related site access authorization definitions.
Nor are individual sites entirely helpless against the Internet Archive’s apparent embracing of the dark side in this regard.
Unless the Archive intends to try go completely into a “ghost” mode, their spidering agents will still be detectable at the http/https protocol levels, and could be blocked (most easily in their entirety) with relatively simple web server configuration directives. If the Archive attempted to cloak their agent names, individual sites could block the Archive by referencing the Archive’s known source IP addresses instead.
It doesn’t take a lot of imagination to see how all of this could quickly turn into an escalating nightmare of “Whac-A-Mole” and expanding blocks, many of which would likely negatively impact unrelated sites as collateral damage.
Even before the Internet Archive’s decision, this class of access and archiving issues had been smoldering for quite some time. Perhaps the Internet Archive’s pouring of rocket fuel onto those embers may ultimately lead to a legally enforced Robots Exclusion Standard — with both the positive and negative ramifications that would then be involved. There are likely to be other associated legal battles as well.
But in the shorter term at least, the Internet Archive’s decision is likely to leave a lot of innocent sites and innocent users quite badly burned.