Legal cases question IP in large language model training

A recent warning from OpenAI about the potential ramifications of a stringent copyright crackdown on artificial intelligence (AI) development has sparked a complex legal debate about the balance between AI advancement and intellectual property (IP) rights.

At the heart of the legal case is whether businesses that make money from licensing or selling web content should be compensated when a large language model (LLM) uses this content for training. Content creators have told the courts that their business models are being undermined, and content created by an LLM which was trained using their IP could be used to create AI-generated content that would be hard to distinguish from that produced by the IP owner.

A lawsuit filed on 27 December by The New York Times claims Microsoft and OpenAI used articles publicly available on The New York Times’ website to create artificial intelligence products that compete with and threaten the newspaper’s ability to provide its web news service. “Defendants’ generative artificial intelligence tools rely on large language models that were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides and more,” The New York Times stated in the filing.

The newspaper said that although Microsoft and OpenAI engaged in wide-scale copying from many sources, they gave content from The New York Times particular emphasis when building their LLMs. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the filing from the newspaper stated.

Meanwhile, in the UK, Stability AI has failed in a bid to have certain claims that it infringed the IP rights of Getty Images thrown out before the case goes to trial in the UK.

Discussing the two lawsuits and how LLMs are trained, Paul Joseph, IP partner at Linklaters, said: “From what I’ve read, generally there is at least an element of reading stuff, making copies of stuff, and then running crawlers or AI systems over them to learn. The making of copies along the way is part of the training process.” However, the act of making copies of the content, according to Joseph, is restricted by copyright laws.

For an LLM provider or an enterprise user of a commercial LLM that is trained this way, he said: “Unless you fall into one of a few copyright exceptions, then it will be an infringement, and it’s not easy to get this sort of commercial training exercise into any exceptions.”

While the legal arguments may be different in the US compared with Europe or the UK, Joseph said: “It is fair to assume that amongst all the trading activities of the different companies providing LLMs, at least some of those activities probably infringe IP rights.”

He said that anyone who makes their money from being a content creator should be concerned by the ability of these products to mimic their own IP. For instance, Getty makes money from licensing the rights to use the images in its vast image library. These images are often incorporated in company brochures and slide decks. If an image-making LLM can create similar content to images in an image library company, this would undermine the business model of that company.

Joseph said lessons can be learnt from the early days of streaming music, with the likes of Napster offering free downloads. “People got music by going on Napster and other sites,” he said. “It was all quite Wild West. No one really knew what was lawful and what wasn’t.”

With the introduction of Spotify came a licensed model. “You knew that music on Spotify was safe and you wouldn’t get a virus when you used it,” said Joseph. “But critically, Spotify made the interaction with the customer so much better than the pirate sites that people were willing to pay a subscription every month to have access to this new music service. The AI world may well go through the same thing.”

As for the current situation, he said enterprise users of commercial LLMs need to be cognisant of the worst-case scenario, which is that any AI content they use may infringe someone’s copyright.

“At best, there is the uncertainty around how the different systems have been trained,” said Joseph. “We’re now in the early throes of litigation, settlement conversations and licensing conversations. Then we’ll come out at the other end with a more coherent and balanced system.”