AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

The AI boom is creating a new content economy – one where savvy content owners are striking multi-million-dollar licensing deals, while others are being automatically scraped by bots to train AI models for free.
It’s impossible not to have noticed the biggest names in AI, including OpenAI, Google, Anthropic, Perplexity and more, at the center of an argument about ethical content scraping.
As AI companies race to train their large language models (LLMs) on original human-generated creations, content-rich organizations – publishers, research firms, media houses, stock imagery providers and content archives – find themselves dragged into an unregulated gold rush.
Some archives and publishers have turned this demand into an opportunity, negotiating lucrative contracts with AI firms that recognize the value of high-quality, human-generated content. Others, however, are being left behind, their work automatically scraped and fed into AI models without compensation or acknowledgment. This source material is then used to generate derivative works that are commercialized by the AI firms, stealing potential revenue from original content owners.
But regardless of whether you’re currently able to monetise your content or not, it’s imperative that you protect it before it becomes fair game for both well-behaved and unethical scraper bots.
So, how can businesses protect their content from AI-driven extraction before it’s too late?
The divide between those who profit and those who lose
AI firms have made it clear that they need high-quality original content to train their LLMs in both broad and niche subjects. This could be a historical archive or world news, forums full of user generated questions and answers, images and videos in specific styles, or a library of specialist journals. It’s becoming clear that training AI on AI-generated content will cause the models to collapse, making original material more valuable.
The scandal is that once these LLMs are trained on datasets, the AI models they fuel can create derivative works that could be sold by the AI company.
Recognizing this, major publishers have leveraged their intellectual property to secure substantial, and in some cases exclusive, licensing agreements. Companies like Condé Nast, home to Vogue and The New Yorker, have signed multi-year deals with OpenAI, while News Corp has secured a $250 million agreement for OpenAI scrapers to access its archive.
Major licensing deals between content owners and AI companies
Content Owner |
AI Company |
Deal Date |
Estimated Value |
Associated Press (AP) |
OpenAI |
Jul 2023 |
Not disclosed |
Shutterstock |
OpenAI, Meta |
Jul 2023 |
Not disclosed |
Axel Springer |
OpenAI |
Dec 2023 |
“Tens of millions of euros” (aiwatch.dog) |
|
|
Feb 2024 |
Approximately $60 million/year (aiwatch.dog) |
Automattic |
OpenAI, Midjourney |
Feb 2024 |
Not disclosed |
Wiley |
Unknown |
Mar 2024 |
$23 million (aiwatch.dog) |
Le Monde, Prisa Media |
OpenAI |
Mar 2024 |
Not disclosed |
Financial Times |
OpenAI |
Apr 2024 |
Between $5 million and $10 million/year (aiobserver.co) |
Stack Overflow |
OpenAI |
May 2024 |
Not disclosed |
Dotdash Meredith |
OpenAI |
May 2024 |
At least $16 million (aiobserver.co) |
Taylor & Francis |
Microsoft |
May 2024 |
Almost £8 million ($10 million) in the first year (aiwatch.dog) |
|
OpenAI |
May 2024 |
Not disclosed |
NewsCorp |
OpenAI |
May 2024 |
Over $250 million over five years (aiwatch.dog) |
Vox Media |
OpenAI |
May 2024 |
Not disclosed |
The Atlantic |
OpenAI |
May 2024 |
Not disclosed |
Time |
OpenAI |
Jun 2024 |
Not disclosed |
Financial Times, Axel Springer, The Atlantic, Fortune, Universal Music Group |
ProRata.ai |
Aug 2024 |
Not disclosed |
Condé Nast |
OpenAI |
Aug 2024 |
Not disclosed |
Wiley |
Unknown |
Aug 2024 |
$44 million in licensing deals (aiwatch.dog) |
Informa plc |
OpenAI |
Aug 2024 |
$10m |
Reuters |
Meta |
Oct 2024 |
Not disclosed |
HarperCollins |
Microsoft |
Nov 2024 |
Not disclosed |
Associated Press (AP) |
|
Jan 2025 |
Not disclosed |
Getty Images, Shutterstock |
Getty Images, Shutterstock |
Jan 2025 |
$3.7 billion merger (turn0news1) |
Protect your content, protect your revenue
These agreements signal an important shift in the way digital content is valued and raise a question about how it can be protected. Not every company has the leverage of an established media giant to fight unwanted scrapers in court and in the press, or to force licensing agreements. Many businesses, from niche research firms to independent publishers, are being left out of such negotiations and their content is freely scraped and repurposed without consent.
We recently wrote about the ineffectiveness of robots.txt as a way of preventing scraper bots from crawling your web estate. Many scraper bots also exhibit unethical and downright malicious behaviors, lying about their purpose, changing their user agent, and using residential IP addresses as proxies, all while increasing traffic load on your site with repeated and high-density crawling.
For businesses hit by these crawlers, the implications are serious. For AI scraper to ingest vast amounts of human-generated content without permission, they effectively bypass paywalls, subscription models, and advertising-based revenue streams. They add cost to your infrastructure bill and steal content to be repurposed in their own products.
The result is a phenomenon where businesses and creators invest in making valuable content, but AI companies monetize it. At the same time, AI-generated summaries or reimaginings of this scraped content reduce the need for users to visit the original sources, further diminishing traffic and revenue.
Beyond financial loss, there is also the issue of intellectual property erosion. Proprietary research, expert insights, in-depth analysis, and art are among the most valuable digital assets a business can produce or license. If AI-generated content, trained on scraped data, can deliver a reimagined version of these insights at scale, businesses risk losing their competitive edge.
AI scraping continues, even as licensing deals are struck
Despite these high-profile licensing agreements, unauthorized content scraping remains widespread.
An additional twist in the story is that some licensing deals are believed to be exclusive, or at least preferential, raising the question of whether content owners are responsible for enforcing this exclusivity by preventing scraper bots sent by rival firms from accessing the content.
Whatever the case, AI firms continue to extract data from websites without permission, using bots to systematically harvest news articles, research papers, blogs, images and video at scale.
How businesses can protect and profit from original content
As content theft through scraping continues, businesses must take proactive measures to protect their property before it’s too late. The first step is recognizing that many AI companies will only pay for content when they are forced to. If access is unrestricted, unlimited scraping remains the default method.
Companies should implement detection and prevention measures to identify scraper bots in real time. Unlike human visitors, scrapers follow identifiable patterns—they move faster, request large volumes of pages, and crawl paths systematically.
Sophisticated scraper bots looking to disguise their activity may rate limit their behavior and use rotating user agents, IP addresses or even residential proxies.
Advanced bot detection solutions can help businesses differentiate legitimate visitor traffic and even licensed scraper bots from unauthorized content harvesting.
Beyond detection, businesses can use sophisticated anti-bot solutions to reinforce tighter access controls like CAPTCHA challenges that can significantly slow down scrapers or even send bots to alternative content such as a licensing information page or even ‘fake’ content paths.
For organizations that produce particularly high-value content, licensing should be a top priority. Instead of allowing AI firms to scrape their data unchecked, businesses can explore ways to structure formal agreements that ensure proper compensation. Just as large publishers have begun to monetize their archives, research firms, thought leadership platforms, and specialized content providers should position themselves as essential data sources rather than passive contributors.
The future of AI and content ownership
The increasing reliance on AI-generated content only reinforces the value of human creativity and expertise. As businesses grapple with the implications of AI scraping, the need to protect intellectual property becomes more urgent than ever. Those that take proactive steps – whether through bot mitigation, licensing negotiations, or stronger content protection measures – will be the ones that retain control over their digital assets and precious revenue streams in the long run.
AI companies have already shown a willingness to pay for content, but they won’t do so unless they have no other choice. Businesses must ask themselves: are they securing their place in this new content economy, or are they unknowingly fuelling AI models for free?
For companies that depend on digital content, research, and proprietary insights, bot mitigation isn’t just a cybersecurity concern—it’s a business imperative. If you’re not actively monitoring how your content is being used, there’s a high chance it’s already being scraped.