Who owns translator data?

Data is invariably described nowadays as the new oil. We know that everyone from supermarkets and social-media platforms to newspapers and foreign governments want to collect, collate, sell and use our data for all sorts of different purposes. And we understand that the data we produce has value – whether we’re airing our thoughts on social media or choosing between tubes of toothpaste in the corner shop.

Yet at the same time, we often fail to appreciate the value of data when we’re on the other end of the equation. When we see the fruits of data that has been harvested. Take machine translation, for example. When we put text into the likes of Google Translate or DeepL, we tend to attribute authorship to the machine, as if it had a mind of its own, and we leave it at that. But the truth is that this output is actually data – data originally produced by human translators, tweaked by reviewers, fed into an algorithm, monetised and recycled in a new form.

The same is true of generative AI chat bots that produce output based on human datasets. And to some, the way these tools work is tantamount to plagiarism or copyright theft. After all, they take texts written by humans and reuse them without giving credit or paying royalties. In the world of arts and literature, many have already spoken out and several lawsuits have been launched. In the translation sector, the issue has opened up a Pandora’s box of tricky questions around who actually owns the legal and moral rights to use and profit from decades of human translation work.

How these debates pan out will have huge implications for the future of the industry – so let’s take a closer look at how MT works, how human linguists fuel it, and what we can do to keep things fair.

Manteau, pelage ou couche?

Primitive forms of machine translation worked on a simple identify-and-swap basis. Dictionaries were matched together and one word in English would be replaced with its equivalent in French, for example. This was very limiting, however, because word X in any given language might have multiple different translations in its new language. The word for coat, for example, might be different depending on whether we’re talking about a winter coat, an animal’s coat or a coat of paint.

Today, modern MT tools are much more advanced – they use neural technology and machine learning to more effectively work out which word is statistically most likely better to be correct, and to better structure sentences in a more natural way. By considering related words in the broader text – like wall, paintbrush, winter, stoat and so on – they can better divine exactly which word to use in each case.

This is possible because MT tools are trained on datasets – huge corpora of parallel texts made up from millions upon millions of words and how they have been translated in the past. Often it is unclear exactly where these datasets have come from, but they can be acquired from brokers or via webcrawlers which scrape text from the internet. Training MT tools on these bilingual corpora allows them to identify patterns and learn what words tend to go together.

What this means is that machine translation is impossible without human translation, and that renewed human input will always be necessary given that language is a natural and organic phenomenon. It is not something fixed and captured in the existing corpora – it evolves and changes on a near daily basis. Without humans continuing to translate neologisms and new slang words like chillax, staycation, skimpflation, fam, deep-fake and literally thousands of others, the machines will never know how to translate them.

So what becomes clear then is that machine translation tools do not actually produce anything out of thin air – they use algorithms and prediction models to cobble together existing translations and produce fresh text for whatever input is fed in. But with so many people involved in making this possible, who actually gets to take credit for the output? The machine itself? Its developers? Or the translators who did all that mental legwork in the first place? How do we decide who has the right to use, sell and profit from the information fed into and out of machine translation tools?

‘Systematic theft on a mass scale?’

Although these issues have been current within translation for a while now, they have recently received traction within broader media debates following a number of notable advancements within AI. When ChatGPT was launched at the beginning of 2023, many of us marvelled at what it could do. But before long, some began to ask questions about the ethical and legal aspects of training chat bots on existing content.

For example, the US Authors Guild launched a class-action lawsuit against OpenAi, accusing it of “systematic theft on a mass scale” for using their work. Other artists also made notable headline-generating interventions, such as Nick Cave, who branded generative AI an exercise in “replication as travesty”, and Sting, who cautioned that we need to be wary of how we use the tools. Illustrators, meanwhile, have launched the hashtag #NotoAIArt, pointing out that image-generation bots replicate their style and designs without giving them any credit or paying royalties.

In response to some of these concerns, OpenAI announced in September 2023 that it would enable websites to block its webcrawler from scraping their content. Many publications, including the Guardian and the New York Times, in addition to big e-commerce platforms such as Amazon, have since chosen to avail of this option. However, this only concerns future content, and does not allow for the removal of materials from existing datasets, which continues to be something of a legal and ethical black hole.

Translation and copyright

Many of the artists, publishers and illustrators affected likely have legitimate claims against how AI companies have been using their data, and some of the cases being tried now may shape the future framework for generative AI. Translators, however, are in a stickier position.

Part of the problem here is the sheer number of parties involved – in many jurisdictions, translators own copyright to their work only as derivative texts, meaning the original author retains a say. In practice, moreover, translators mostly cede their authorship rights to the agency that hires them, allowing them to reuse translation data to offer client discounts. So by the time technology gets involved, there are already three parties with an ownership stake in the content, not to mention any proofreaders or client reviewers who may also have helped shape the text.

Adding to this complexity is the lack of transparency along the chain. It is not possible to take a piece of MT output and reverse engineer it back to the translations it has drawn from or used. It is impossible to link back to an individual agency, never mind an individual translator. This makes it difficult for linguists or agencies to prove their content has been used without authorisation, and it complicates the idea of any kind of royalty scheme that might seek to compensate linguists for their data.

Towards a fairer future

Yes, data is indeed the new oil and even within the translation industry, it is fuelling new possibilities and shaking up revenue streams for many, from linguists and agencies to big tech companies entering the market. As we have seen, the question of who owns translation data is a thorny one with no clear answer, so instead perhaps we should be posing the question in a different way – how can we use this data fairly?

Today, many translators feel like they are getting a raw deal. Like turkeys voting for Christmas, they know their data has helped to fuel and refine MT programmes. Programmes which are fantastic and incredibly useful, and which could be an invaluable boon to the entire industry – but which, at the same time, are currently reducing translator earnings and, in some cases, muscling linguists out of work altogether.

We are standing on the frontier of a new world in which AI and automation will play increasingly fundamental roles in our lives. As we cross the threshold and move further into this new reality, it is important we remember that unlike oil, data is not a raw material to be mined from rock and earth. It is the product of hard-working and creative humans – and no matter what we do with it, we need to be fair towards those who made it possible.