AI is supposed to be the miracle technology of our age, but I still can’t get past the fact that it seems to need an extensive amount of “training” from the intellectual property and copyrighted works of humans. It’s like Pac-Man eating up all the books, music, and art in its path. And now add to that illustrious group: YouTube videos. You see, a nonprofit called EleutherAI believes that the development of AI, though expensive, should not be controlled solely by Big Tech. So in 2020 they released a dataset called The Pile, which they describe as “a large-scale corpus for training language models, composed of 22 smaller sources,” and it’s free to download. Basically, EleutherAI has created an AI training model for the masses to use. So naturally, large companies like Apple also use their services. It is through the Pile that Apple and others have fed their own AI models with YouTube videos, and the video creators are not happy:
The Pile was not intended for Big Tech, but here we are: AI models at Apple, Salesforce, Anthropic, and other major technology players were trained on tens of thousands of YouTube videos without the creators’ consent and potentially in violation of YouTube’s terms, according to a new report appearing in both Proof News and Wired. The companies trained their models in part by using “the Pile,” a collection by nonprofit EleutherAI that was put together as a way to offer a useful dataset to individuals or companies that don’t have the resources to compete with Big Tech, though it has also since been used by those bigger companies.
Creators are seeing red, but it’s a legal gray area: The Pile includes books, Wikipedia articles, and much more. That includes YouTube captions collected by YouTube’s captions API, scraped from 173,536 YouTube videos across more than 48,000 channels. That includes videos from big YouTubers like MrBeast, PewDiePie, and popular tech commentator Marques Brownlee. On X, Brownlee called out Apple’s usage of the dataset, but acknowledged that assigning blame is complex when Apple did not collect the data itself. He wrote: “Apple has sourced data for their AI from several companies. One of them scraped tons of data/transcripts from YouTube videos, including mine. Apple technically avoids “fault” here because they’re not the ones scraping. But this is going to be an evolving problem for a long time.”
Gotta love geek humor: Coincidentally, one of the videos used in the dataset was an Ars Technica short film wherein the joke was that it was already written by AI. Proof News’ article also mentions that it was trained on videos of a parrot, so AI models are parroting a parrot, parroting human speech, as well as parroting other AIs parroting humans. As AI-generated content continues to proliferate on the internet, it will be increasingly challenging to put together datasets to train AI that don’t include content already produced by AI.
Is it fair use? The Pile is often used and referenced in AI circles and has been known to be used by tech companies for training in the past. It has been cited in multiple lawsuits by intellectual property owners against AI and tech companies. Defendants in those lawsuits, including OpenAI, say that this kind of scraping is fair use. The lawsuits have not yet been resolved in court.
The Pile is a ‘robust data collection’ of intellectual property: However, Proof News did some digging to identify specifics about the use of YouTube captions and went so far as to create a tool that you can use to search the Pile for individual videos or channels. The work exposes just how robust the data collection is and calls attention to how little control owners of intellectual property have over how their work is used if it’s on the open web. It’s important to note that it is not necessarily the case that this data was used to train models to produce competitive content that reaches end users, however.
Whelp, this is social media all over again, right? In that the technology is moving faster than the public can fully understand its consequences, and certainly faster than laws can regulate it. There are so many layers to this, my brain broke thinking about all the laws that need to be written. Were the videos used to train AI to make competitive material, or purely for “education” purposes? Does that distinction even matter when it comes to intellectual property? A lot of this will come down to YouTube’s terms of service, and how courts interpret the protections for creator content outlined there. But my biggest question is this: let’s say all the legal issues are ironed out and creators’ works are protected (I know, ha)… Then what was all this (alleged) theft for? I’m genuinely hoping for an answer beyond, they fed AI the material just because they could.
Photos via YouTube/Marques Brownlee
It is outrageous, and they’re taking advantage of the murkiness and their sheer size to get away with it. The horse has left the barn – their tools have been trained, and there will be no reasonable way to assign a value of the theft. It’s so much further on than the “old” days – if you used a photo that wasn’t yours on your website, for instance – it’s a clear copyright violation with known penalties. The derivative use (where you’re morphing one work into another) also has precedence but is complicated (e.g., is it satire?). But this … this is very dangerous and content creators, authors, songwriters, artists are all at risk.
This is a big issue true, and legislation needs to catch up and protect intellectual property but my biggest issue with AI is that we are solidifying our biases in AI and because people don’t understand it, the assume it’s neutral because it’s a machine. Because it’s learning from the dataset it’s being fed, if you have say an AI helping you make hiring decisions and say 70% of successful applicant CV’s were male it’s going to prefer male applicants going forward (or white…) a really good book on this (and other issues where missing data is harming women) is “invisible women” (and it’s even worse for minorities)
Exactly. Now we need to teach generations of people how to know if what they see is a real thing or an AI created image. Stat.
“neutral bc it’s a machine” is the biggest flaw. The machines were built and their algorithms designed by humans. Controlling what data is used, the order it’s used, they weight it’s given (or dismissed) … the AI is only creating from what it’s been fed.
Kismet
Enjoyed this. Smart. Tech so fast they haven’t created laws to govern it yet. So this would be 2.0. Cyber crime & data tech analyst jobs will potentiallybe on the uprise. A new breed of law evolving & fast moving to try to play catch up when it’s already decades behind. So those jobs will be in demand, too.
Side note. I went to college with Chad Hurley. Nice guy- has done a lot of good as an alumni. 😀
When they say “where is everyone” well this is why. Everyone is tired of contributing. There is an early 90s blog that as soon as he gained a little traction (within a week of creation) multiple youtube channels popped up using his post! When he got tired and stop posting the channels stop posting!
We saw it with blogs, tiktok, and now YouTube. Small creators are tired of helping millionaires grow rich.
It’s infuriating that we’re spending our collective resources on this kind of thing, instead of existential problems that affect all of us.