Your online content is fodder for “freeware” training models • The Register

Microsoft AI CEO Mustafa Suleiman said this week that machine learning companies can scrape most of the content publicly available online and use it to train neural networks because it’s essentially “freeware.”

Soon after, the Center for Investigative Reporting Sued OpenAI And Microsoft, the company’s largest investor, filed suit, alleging that the company “used and compensated non-profit news organizations for their content without permission.”

It follows in the footsteps of eight newspapers. He sued OpenAI and Microsoft. The company filed a lawsuit in April over alleged content misappropriation, four months after The New York Times also filed a lawsuit.

And the two authors He sued OpenAI and Microsoft. In January, he sued OpenAI and GitHub for training an AI model using the author’s work without permission, and in 2022, multiple unidentified developers sued OpenAI and GitHub for using publicly available programming code to train generative models in violation of software license terms.

Questioned interview Suleiman, who debated CNBC’s Andrew Ross Sorkin at the Aspen Ideas Festival about whether AI companies are effectively stealing the world’s intellectual property, acknowledged that there is a debate but sought to distinguish between content that people post online and content that is backed by corporate copyright holders.

“When it comes to content that’s already on the open web, I think the social contract around that content has been fair use since the 1990s,” he said. “Anyone can copy it, recreate it, reproduce it. It was freeware, so to speak. That’s the understanding.”

Suleiman acknowledged that there is another category of content: content published by companies staffed by lawyers.

“There’s another category where websites or publishers or news organizations can specifically say, ‘Please don’t scrape or crawl for any purpose other than indexing,’ so that other people can find that content,” he explained. “But it’s a gray area. And I think that will be resolved in court.”

That’s an understatement. While Suleiman’s comments are sure to anger content creators, he’s not entirely wrong. It’s not clear where the legal lines are when it comes to training AI models and the output of those models.

Most people who post content online as individuals are violating rights in some way by agreeing to the terms of use offered by the major social media platforms. Reddit’s decision to license its users’ posts to OpenAI would not have been made if the company believed its users had legitimate rights to their memes and manifestos.

The fact that OpenAI and other AI modeling companies have struck content deals with major publishers shows that a strong brand, deep pockets and legal team can bring big tech businesses to the negotiating table.

In other words, anyone who creates content and posts it online is creating freeware, unless they can hire or attract lawyers willing to challenge Microsoft and companies like it.

in paper In a paper published last month on SSRN, Frank Pasquale, professor of law at Cornell Tech and Cornell Law School in the US, and Hao-Cheng Sun, associate professor of law at the University of Hong Kong, explore the legal uncertainty surrounding the use of copyrighted data to train AI and whether courts would find such use fair. They conclude that AI needs to be addressed at the policy level, as current law is not suited to answer the questions that need to be addressed now.

“Given significant uncertainty about the lawfulness of AI providers’ use of copyrighted works, lawmakers need to articulate a bold new vision for rebalancing rights and responsibilities, just as they did with the development of the internet (the Digital Millennium Copyright Act of 1998),” they argue.

The authors suggest that continued free collection of creative works will threaten not only writers, composers, journalists, actors and other creative professionals, but also generative AI itself, eventually leading to a shortage of training data: they predict that people will stop putting their work online if it is used to power AI models that reduce the marginal cost of content creation to zero, removing the possibility of rewarding creators.

That’s the future Suleiman sees. “The economics of information are going to change fundamentally because we can bring the cost of producing knowledge down to zero marginal cost,” he said.

All of this freeware that you helped create is available for a small monthly subscription fee.®

Your online content is fodder for “freeware” training models • The Register

Editor's Pick

Bubba’s Hot Chicken, a 50-year-old recipe, expands to new locations

Katy Perry shows off hot dress with mega train featuring lyrics from new single: Pics

AI investment goes from “hot name” to necessity

Top Writers

Opinion

You Might Also Like

European Commission finds Meta’s ‘pay or agree’ model fails to meet EU competition rules

Sega’s Crazy Taxi reboot is a “massively multiplayer driving game”

Japan’s SmartHR Raises $140M Series E as Strong Demand for HR Tech Drives ARR to $100M

Shadaloo Fighting Pass returns to Street Fighter 6