Data Dignity: Developers Must Solve the AI Attribution Problem
Data provided by humans can be seen as a form of labor that powers artificial intelligence; The Economist.
Along with the slightly irrational fear of AI taking over the world, there is the more concrete issue of AI ripping you off; this post looks at the problems of attribution and the ways that the dev community might be involved in fixing them.
There is now a confluence of three issues that may force us to change how we work with the web:
- The belief that our personal data and created media has been abused by large corporations;
- The new GPT-style AIs that can trawl vast amounts of user data from the web; and
- The fact that we currently don’t have a consistent way of digitally memorializing who wrote what.
Your social graph has been fuelling companies for some time, but the new problem is that text, images and soon video can be “remixed” by AIs without any hope that the originators will be credited. Compare this with how the music industry dealt with a handful of 80’s hip hop artists who “borrowed” existing tracks. They were occasionally pilloried or even sued. But those misdemeanors happened at human speed, in low numbers, and under a spotlight. Midjourney could use an image it found on ArtStation to produce a new image hundreds of times an hour, without the artist ever knowing. Today the likes of Dr. Dre are recognized as part of the establishment — he pays any royalties due.
What Is Data Dignity?
“Data Dignity” is a movement that was forged before generative AI came about, and is firmly connected to noted tech commentator Jaron Lanier. The theory is that the economy should be altered to compensate people when data about them is used, or the data they created is remixed. These ideas are driven by the idea that the “free” online economy has been a disaster in terms of recognition or remuneration. It is quite clear that generative AI will make this position much worse.
Our concern in this article is not with the freedom of information to travel, it is with the information about that information (the metadata) — and having it not become lost baggage. So what I’m advocating is, to retain the travel theme, a baggage tag for information. And some trusty baggage handlers.
Take any document on the web; and a quote in that document. There is no simple automated way of getting the author of that document, or ensuring the quote is attributed correctly to the author if it later appears elsewhere. Hence when a GPT AI mashes two disparate paragraphs together, the provenience is completely lost.
By comparison, Twitter is structured to memorialize the author of the tweet, and even the author of the tweet a tweeter retweets. The metadata associated with a tweet (everything associated with the creation of the tweet, other than the words) is more than just the author; it includes time, location, language, and unique Id. So given that Twitter works, why did we let the web go wrong?
During the 2000s, the Semantic Web was proposed as an improved version of the World Wide Web. The goal was to create a web where intelligent agents would be able to understand the content of webpages using injected metadata to provide useful services to humans or to interact with other intelligent agents.
Unfortunately, the Semantic Web project was always academic in nature. The proposed languages for adding metadata to web pages were difficult to use. The inference engines in the early 2000s were slow. As we’ll see below, metadata is all too often a weapon of Search Engine Optimization (SEO), not a sword of truth. And metadata itself is not static — it can age and needs to be maintained. Another problem came with the birth and vertiginous rise of JSON. It came a little too late for the Semantic Web, so JSON’s older (and much uglier) step-sister XML was used. But we should accept that the aim of the project was good.
How to Fix Things
There are three basic paths to inject meaning back into the web, and help fix the attribution problem:
- Let AI create and maintain metadata.
- Use better tools and agreements to re-inject useful and consistent metadata back into the web.
- Stick the metadata transparently into the cloud.
None of these are mutually exclusive. We know that, for the moment, ChatGPT doesn’t like to use the web to enhance its training data, though I expect it would easily be able to find the author of a given document if it understands its own recommendations here:
It seems that Google can already do this, even though it doesn’t trumpet the solution:
Note that the author is clearly selected in bold, like a search result.
If we just let a few large companies form tons of metadata about everything so that their LLMs can train properly, we could then task the same companies with using AI to track attribution. This may seem like a reasonable thing to do, but without oversight we will never be sure exactly how much metadata is stored while achieving this.
Metadata in the Document
Web pages have a built-in ability to store metadata in tag form for free, without distorting the information they present. In fact, the point of HTML is to use metadata to enhance information. Inside the document you are reading now, you might find the following tag
<meta name="author" content="David Eastman" class="yoast-seo-meta-tag">
This metadata points to (in this case) myself as the author. This is why AI doesn’t have to be too smart to know who wrote this article. But we are immediately reminded that the reason why we currently add metadata is nearly always to serve SEO.
What about looking for the same information in Jaron’s article? The problem is that there exists no common model or “cow path”. So the metadata we find is not quite what we expect:
<meta name="author" content="Condé Nast">
Fortunately, in the same document we can also find:
<meta property="article:author" content="Jaron Lanier">
So there are shifting sands even between platforms that offer the same service. And these results are from very responsible and stable publications. The solution here is for developers to communicate a bit more, agree on common standards across similar industries, and put the solidity of the web in front of other considerations where possible. This is of course, easy to say.
An API Solution
A simple REST solution could be implemented in most content platforms. You can query this site in HTML for authors; for example, this query will yield my articles using “https://thenewstack.io/author/david-eastman/”. Although, you would need to know that formulation of my name, and accept that a few early articles won’t appear. What would be more useful is to extract the author (along with other metadata) for any given article in a RESTful fashion.
This above just uses the natural REST query interface, although it could be achieved in a neater formulation — all we want is “return the author of the post called how-to-software”.
So if you are designing your REST API for any content platform, make sure the user can get metadata back on any pages they have access to.
Store the Information Elsewhere
Metadata could also be collected and placed in a neutral repository. This would allow third parties to work on metadata while giving the appropriate public access to it.
For example, there are a lot of companies that will use AI to do content moderation — which will probably try to add metadata context to existing dodgy media. One example startup proposing AI video moderation is unitary.ai. Ironically, this is the converse problem — instead of making sure media retains attribution, this is adding metadata to media that might otherwise want to stay in the shadows. If the metadata was in a neutral location, users wouldn’t have to accept the platform’s last word on all moderation issues.
Similarly, regulated industries trying to use generative AI will probably interact with compliance middleware to avoid compromising recognized compliance standards in user responses. It seems reasonable that the rules and generated metadata would be kept in an open and accessible fashion. Clearly, the dev challenges here are to design architecture and standards to solve these problems.
So the future for fair generative AI does depend on the development community’s willingness to provide plenty of ways for attribution to be kept, otherwise AI will spend most of its time chatting with the legal system.