Giving Creators and Websites Control Over Generative AI

Seemingly overnight, the Internet is awash with controversies over Generative Artificial Intelligence (GAI) systems, and their potential positive and negative impacts on the Net and the world at large.

It also seems very clear that unless we (for once!) get ahead of the potential problems with this new technology that seem to be rushing toward us like a freight train, there could be some very tough times ahead for creators, websites, and ordinary Internet users around the world.

I’m not writing a tutorial here on GAI, but very briefly it’s not the kind of “backend” AI systems with which most of us are more familiar, used for research and modeling, sorting the order of search results and suggestions, and even the kinds of generally useful very brief “answers” we see as (for example) Google Knowledge Panels, featured snippets, or short Google Assistant answers (and the similar features of other firms’ products).

GAI is very different, because it creates (and this is a greatly simplified explanation) what appears to be (at least in theory) completely *new* content, based on its algorithms and the data on which it has been trained.

GAI can be applied to text, audio, imagery, video — pretty much everything we’ve come to associate with the Net. And already, serious problems are emerging — not necessarily unexpected at this early stage, but ones that we must start dealing with now or risk a maelstrom later.

GAI chatbots have been found to spew racist and other hateful garbage. The long-form answers and essays that are the stock-in-trade of many GAI systems can be beautifully written, appear knowledgeable and authoritative — but still be riddled with utterly incorrect information. This can be a hassle indeed even with purely technical articles that have had to be withdrawn as a result, but can get downright scary when they involve, as in one recent case, an article on men’s health issues.

There are more problems. GAI can easily create “fake” pornography targeting individuals. It can be used to simulate people’s voices for a range of nefarious purposes — or even potentially just to simulate the voices of professional voice actors without their permission.

Eventually, the kind of scenario imagined in the 1981 film “Looker” — where actors once scanned could be completely emulated by (what we’d now call) GAI systems — could actually come to pass. We’re getting quite close to this already in the film industry and the world of so-called deepfakes — the latter potentially carrying enormous risks for disinformation and political abuse.

All of this tends to point us mainly in one direction: How GAI is trained.

In many cases, the answer is that websites are crawled and their data used for GAI purposes, without the explicit permission of the creators of that data or the sites hosting it.

Since the beginning of Search on the Internet, there has been something of a largely unwritten agreement. To wit: Search engines spider and index sites to provide lists of search results to users, and in return those search engines refer users back to those original sites where they can get more information and find other associated content of interest.

GAI in Search runs the risk of disrupting this model in major ways. Because by presenting what appear to be largely original long-form essays and detailed answers to user search queries, the probability of users ever visiting those sites that (often unknowingly) provided the GAI training data, even when links are present, is likely to drop precipitously. Even with links back provided by the GAI answers, why are users going to bother visiting those sites that provided the data to the GAIs, if the GAIs have already completely answered those users’ questions?

Complicating this even further is that the outputs of some GAI systems appear to frequently include largely or even completely intact (or slightly reworded) stretches of text, elements of imagery, and other data that the GAI presents as if they were wholly original.

Creators and websites should be able to choose if and how they wish their data to be incorporated into GAI systems. 

Accomplishing this will be a complex undertaking, likely involving both technical and legislative aspects in order to be even reasonably effective, and will almost certainly always be a moving target as GAI systems advance.

But a logical starting point could be expansion of the existing Internet Robots Exclusion Protocol (REP — e.g. robots.txt, meta tags, etc.) currently used to express website preferences regarding search indexing and associated functions. While the REP is not universally adhered to today, major sites usually do follow these directives.

Indeed, even defining GAI-related directives for REP will be enormously challenging, but this could get the ball rolling at least.

We need to immediately start the process of formulating the control methodologies for what training data Generative Artificial Intelligence systems are permitted to use, and the manners in which they do so. Failure to begin considering these issues risks enormous backlash against these systems going forward, which could render many of their potential benefits moot, to the detriment of everyone.