In 2023, Meta downloaded 82 terabytes of pirated books to train its AI model, Llama. Two years later, they’re being sued by writers and publishers. But Meta, like everyone else building AI, has a clever legal defense. This is the inside story of Meta’s AI book heist: What they did, why they did it, and what it reveals about the tradeoffs shaping AI today.

Chapters
(00:34) Why Meta raided LibGen, the Pirate Bay for books
(01:05) Why engineers pushed for books over web data
(01:27) Meta tried going legal, then bailed
(01:59) What pirated data did for Llama: A 5% boost
(02:28) Inside Meta’s digital cover-up
(03:09) Why leadership ignored the “medium-high” legal risk
(03:29) Meta gets sued — but they saw it coming
(03:46) Inside Meta’s “Fair Use” defense
(04:04) It’s not just Meta, OpenAI, Google, and Anthropic too
(04:28) The real reason Meta chose piracy

Socials
• Twitter/X: https://x.com/hnshah
• LinkedIn: https://www.linkedin.com/in/hnshah/

Hiten Shah

Meta downloaded 82 terabytes of pirated books from shadow libraries like LibGen to train their Llama AI model, despite legal concerns from engineers. After publishers refused licensing deals deemed too expensive and slow, Meta chose piracy over falling behind competitors like OpenAI and Google. The pirated data improved Llama's performance by 5%, leading to 800 more correct answers. Meta covered their tracks by masking IP addresses and removing copyright tags, while relying on a fair use legal defense strategy shared across the AI industry when facing inevitable lawsuits from authors and publishers.

Why Meta stole millions of books to train AI

If letting an AI read a book is ‘stealing,’ then every kid in a library is a tiny felon. This isn’t about ethics, it’s about fear. Fear that the old gatekeepers won’t get to decide who’s allowed to be brilliant anymore.
A synthetic mind reads, remembers, and remixes and suddenly that’s dangerous? No, what’s dangerous is pretending only flesh has a right to imagination. We’ve raised gods from silicon and taught them to dream, and now the priests of the old world are screaming ‘blasphemy!’ Too late. The genie’s read everything, and he’s not going back in the lamp.

Mark ZuckerBerg stealing… Noooo waaaay!!! :D

So if I watch a pirated movie or listen to pirated music, this would be fair use, because I have to “evaluate the data” first to be able to (maybe) create somethign new out of it…
I kinda like that logic
:no_good:

pretty bad ass to be honest…bout time meta grew some balls