Zadie Smith, Stephen King, and Rachel Cusk's copyrighted materials are being utilized to train artificial intelligence (AI).

Umut
0

 

The works of numerous authors, including Zadie Smith, Stephen King, Rachel Cusk, Margaret Atwood, Haruki Murakami, and Jonathan Franzen, have been utilized to train models operated by companies such as Meta and Bloomberg.




Zadie Smith, Stephen King, Rachel Cusk, and Elena Ferrante are among a multitude of authors whose unauthorized works have been utilized to train artificial intelligence tools, as revealed in a report from The Atlantic.


Over 170,000 titles were input into models developed by companies like Meta and Bloomberg, based on an assessment of the "Books3" dataset, which these firms utilized to construct their AI tools.


Books3 played a role in training Meta's LLaMA and other large language models, including OpenAI's well-known ChatGPT. The dataset was also instrumental in the training of Bloomberg's BloombergGPT and EleutherAI's GPT-J, and it is presumed that it has been incorporated into other AI models as well.


Within the Books3 collection, approximately one-third of the titles are fiction, while two-thirds are nonfiction. The majority of these titles were published within the past two decades. Alongside the writings of Smith, King, Cusk, and Ferrante, the dataset includes copyrighted works such as 33 books by Margaret Atwood, a minimum of nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan, and five by David Grann.


The works of authors such as George Saunders, Junot Díaz, Michael Pollan, Rebecca Solnit, and Jon Krakauer have also been incorporated. Notably, 102 pulp novels by L. Ron Hubbard, the founder of Scientology, and 90 books by pastor John MacArthur are part of the dataset.


The selection encompasses publications from both major and independent publishers. This includes over 30,000 titles from Penguin Random House, 14,000 from HarperCollins, 7,000 from Macmillan, 1,800 from Oxford University Press, and 600 from Verso.


This revelation follows a lawsuit brought forth last month by three authors – Sarah Silverman, Richard Kadrey, and Christopher Golden – who alleged that their copyrighted works were used for training Meta's LLaMA. Subsequent analysis confirmed that the writings of these three plaintiffs are indeed included in the Books3 dataset.


OpenAI, the company behind the AI chatbot ChatGPT, has also faced accusations of training its model using copyrighted materials. The origins of OpenAI's training data were hinted at in a 2020 paper released by the company, referring to two "internet-based books corpora." One of these, named Books2, is estimated to encompass nearly 300,000 titles. A lawsuit from June asserts that the sole sources for such vast content are "shadow libraries" like Library Genesis (LibGen) and Z-Library, platforms where books can be downloaded en masse through torrent systems.


Shawn Presser, an independent AI developer who initially created Books3, expressed empathy for authors' concerns. He designed the database to enable the development of generative AI tools by anyone, and he holds apprehensions about the risks associated with major corporations gaining control over this technology.


Although a spokesperson from Meta declined to provide a comment regarding the utilization of Books3 to The Atlantic, a representative from Bloomberg confirmed that the company indeed employed the dataset. The spokesperson stated, "We will not incorporate the Books3 dataset into the data sources utilized for training upcoming iterations of BloombergGPT."



Post a Comment

0 Comments

Post a Comment (0)

#buttons=(Ok, Go it!) #days=(20)

Our website uses cookies to enhance your experience. Check Now
Ok, Go it!