A Terrible Decision by the Internet Archive May Lead to Widespread Blocking

UPDATE (23 April 2017):  More Regarding a Terrible Decision by the Internet Archive

– – –

We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years — for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.

For any given site, the decision to participate or not with the web scanning systems at the Internet Archive (or associated with any other “spidering” system) is indicated by use of the well established and very broadly affirmed “Robots Exclusion Standard” (RES) — a methodology that uses files named “robots.txt” to inform visiting scanning systems which parts of a given website should or should not be subject to spidering and/or archiving by automated scanners.

RES operates on the honor system. It requests that spidering systems follow its directives, which may be simple or detailed, depending on the situation — with those detailed directives defined comprehensively in the standard itself.

While RES generally has no force of law, it has enormous legal implications. The existence of RES — that is, a recognized means for public sites to indicate access preferences — has been important for many years to help hold off efforts in various quarters to charge search engines and/or other classes of users for access that is free to everyone else. The straightforward argument that sites already have a way — via the RES — to indicate their access preferences has held a lot of rabid lawyers at bay.

And there are lots of completely legitimate reasons for sites to use RES to control spidering access, especially for (but by no means restricted to) sites with limited resources. These include technical issues (such as load considerations relating to resource-intensive databases and a range of other related situations), legal issues such as court orders, and a long list of other technical and policy concerns that most of us rarely think about, but that can be of existential importance to many sites.

Since adherence to the RES has usually been considered to be voluntary, an argument can be made (and we can pretty safely assume that the Archive’s reasoning falls into this category one way or another) that since “bad” players might choose to ignore the standard, this puts “good” players who abide by the standard at a disadvantage.

But this is a traditional, bogus argument that we hear whenever previously ethical entities feel the urge to start behaving unethically: “Hell, if the bad guys are breaking the law with impunity, why can’t we as well? After all, our motives are much better than theirs!”

Therein are the storied paths of “good intentions” that lead to hell, when the floodgates of such twisted illogic open wide, as a flood of other players decide that they must emulate the Internet Archive’s dismal reasoning to remain competitive.

There’s much more.

While RES is typically viewed as not having legal force today, that could be changed, perhaps with relative ease in many circumstances. There are no obvious First Amendment considerations in play, so it would seem quite feasible to roll “Adherence to properly published RES directives” into existing cybercrime-related site access authorization definitions.

Nor are individual sites entirely helpless against the Internet Archive’s apparent embracing of the dark side in this regard.

Unless the Archive intends to try go completely into a “ghost” mode, their spidering agents will still be detectable at the http/https protocol levels, and could be blocked (most easily in their entirety) with relatively simple web server configuration directives. If the Archive attempted to cloak their agent names, individual sites could block the Archive by referencing the Archive’s known source IP addresses instead.

It doesn’t take a lot of imagination to see how all of this could quickly turn into an escalating nightmare of “Whac-A-Mole” and expanding blocks, many of which would likely negatively impact unrelated sites as collateral damage.

Even before the Internet Archive’s decision, this class of access and archiving issues had been smoldering for quite some time. Perhaps the Internet Archive’s pouring of rocket fuel onto those embers may ultimately lead to a legally enforced Robots Exclusion Standard — with both the positive and negative ramifications that would then be involved. There are likely to be other associated legal battles as well.

But in the shorter term at least, the Internet Archive’s decision is likely to leave a lot of innocent sites and innocent users quite badly burned.

–Lauren–

The Google Page That Google Haters Don't Want You to Know About
More Regarding a Terrible Decision by the Internet Archive

10 thoughts on “A Terrible Decision by the Internet Archive May Lead to Widespread Blocking”

  1. Thank you for the thoughtful post. I hope the future will not be so dire.

    For 20 years, the way many webmasters, or individuals on hosted platforms, have asked to have pages excluded from the Wayback Machine has been by sending email to info@archive.org.

  2. While there is such a thing as going to the opposite extreme, I do think the current policies were a bit absurd, namely the retroactive application of robots.txt to archived copies that were not under the control of the current domain name owner. Think about it, you essentially have a situation where the CURRENT owner of a domain would be able to dictate control over past content, most likely content they had no ownership of, link to at all.

    1. There are several different ways to look at this aspect. For example, the domain this blog in on was one of the first 40 dot-com domains ever issued — I’ve been using it continuously since then for just over 30 years. So nobody other than me has ever published material under this domain.

      But cases of domain transfers get complex very quickly. While it’s true that there are bad players who attempt to hide good data, there are also good players who can be unfairly chastised by earlier bad data that has nothing whatsoever to do with them. For example, why should someone who buys a vacant domain in good faith be forever branded by some slimeball who had nasty material on the domain many years earlier? By analogy, when you buy a house, you’re not forced to put up a sign that says “Criminals lived here more than ten years ago!”

      No easy answers to this class of domain issues.

  3. Because of this, I’ve taken the step of completely denying access to their ASN on my server’s firewall, since they are intent on ignoring my wishes as the owner of my website. For anyone interested, here is how you can obtain the IP ranges in their ASN:

    whois -h whois.radb.net -T route AS7941 -i origin | grep route | awk ‘{ print $NF }’

    You’ll have to use this on Linux or in your Mac terminal, but any time they add IP ranges to their ASN, you can simply run this command again and obtain their new IPs.

    1. It is still my hope that the Internet Archive will reverse their antisocial decision, thereby making actions such as you detail above unnecessary.

  4. From this https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives
    I think you misrepresent the internet archive’s reason to do this: It’s not “some crawlers already do it, so we must too” but more like “robots.txt are used to do SEO and the resulting robots policies for search engine crawlers cause problems when applied to archiving crawlers” (not actual quotes). And I think you don’t really counter that argument in your blog post.

    IMHO a solution could be to change the RES format in a way that search engine crawlers and archive crawlers always have to be mentioned separately, e.g. one could have crawler categories (at least “search engine crawler”, “archive crawler” and “other”) and you’d have to configure them separately. So SEO can’t accidentally hinder archiving. The standard is from 1994 and could well use a revision by a committee.

    1. While pretty much any standard could do with some improvement, the RES has proven to be quite versatile and comprehensive, and can easily be used without introducing the category of confusion that you assert.

      If a site has a robots.txt RES directive such as:

      User-agent: ia_archiver
      Disallow: /

      … it should be respected by the Internet Archive. Period. And increasingly, courts appear to be taking this view of RES directives.

      1. They changed their bot’s name to archive.org_bot without telling site owners and are now completely ignoring robots.txt blocking of ia_archiver. For example, the history of seobook.com which has always blocked ia_archiver is up for all to see on archive.org. It is clear that they are a malicious actor. Blocking is sadly already necessary.

    2. Constantin Berhard:
      The Internet Archive is not a legal deposit; content creators are not obligated to supply them with copies of their works. Wonderful that the Archive has a mission …, but that does not, and should not, negate the rights of others.
      A sensible discussion of the obvious problems would serve everyone’s best interest.

Comments are closed.