• Resources
  • Blogs
  • AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

Netacea logo
Threat Research Team
02/04/25
5 Minute read
What is Content Scraping and How Does it Affect Your Business?

The AI boom is creating a new content economy – one where savvy content owners are striking multi-million-dollar licensing deals, while others are being automatically scraped by bots to train AI models for free.  

It’s impossible not to have noticed the biggest names in AI, including OpenAI, Google, Anthropic, Perplexity and more, at the center of an argument about ethical content scraping. 

As AI companies race to train their large language models (LLMs) on original human-generated creations, content-rich organizations – publishers, research firms, media houses, stock imagery providers and content archives – find themselves dragged into an unregulated gold rush. 

Some archives and publishers have turned this demand into an opportunity, negotiating lucrative contracts with AI firms that recognize the value of high-quality, human-generated content. Others, however, are being left behind, their work automatically scraped and fed into AI models without compensation or acknowledgment. This source material is then used to generate derivative works that are commercialized by the AI firms, stealing potential revenue from original content owners.  

But regardless of whether you’re currently able to monetise your content or not, it’s imperative that you protect it before it becomes fair game for both well-behaved and unethical scraper bots. 

So, how can businesses protect their content from AI-driven extraction before it’s too late? 

The divide between those who profit and those who lose 

AI firms have made it clear that they need high-quality original content to train their LLMs in both broad and niche subjects. This could be a historical archive or world news, forums full of user generated questions and answers, images and videos in specific styles, or a library of specialist journals. It’s becoming clear that training AI on AI-generated content will cause the models to collapse, making original material more valuable.  

The scandal is that once these LLMs are trained on datasets, the AI models they fuel can create derivative works that could be sold by the AI company.  

Recognizing this, major publishers have leveraged their intellectual property to secure substantial, and in some cases exclusive, licensing agreements. Companies like Condé Nast, home to Vogue and The New Yorker, have signed multi-year deals with OpenAI, while News Corp has secured a $250 million agreement for OpenAI scrapers to access its archive. 

Major licensing deals between content owners and AI companies

Content Owner

AI Company

Deal Date

Estimated Value

Associated Press (AP)

OpenAI

Jul 2023

Not disclosed

Shutterstock

OpenAI, Meta

Jul 2023

Not disclosed

Axel Springer

OpenAI

Dec 2023

“Tens of millions of euros” (aiwatch.dog)

Reddit

Google

Feb 2024

Approximately $60 million/year (aiwatch.dog)

Automattic

OpenAI, Midjourney

Feb 2024

Not disclosed

Wiley

Unknown

Mar 2024

$23 million (aiwatch.dog)

Le Monde, Prisa Media

OpenAI

Mar 2024

Not disclosed

Financial Times

OpenAI

Apr 2024

Between $5 million and $10 million/year (aiobserver.co)

Stack Overflow

OpenAI

May 2024

Not disclosed

Dotdash Meredith

OpenAI

May 2024

At least $16 million (aiobserver.co)

Taylor & Francis

Microsoft

May 2024

Almost £8 million ($10 million) in the first year (aiwatch.dog)

Reddit

OpenAI

May 2024

Not disclosed

NewsCorp

OpenAI

May 2024

Over $250 million over five years (aiwatch.dog)

Vox Media

OpenAI

May 2024

Not disclosed

The Atlantic

OpenAI

May 2024

Not disclosed

Time

OpenAI

Jun 2024

Not disclosed

Financial Times, Axel Springer, The Atlantic, Fortune, Universal Music Group

ProRata.ai

Aug 2024

Not disclosed

Condé Nast

OpenAI

Aug 2024

Not disclosed

Wiley

Unknown

Aug 2024

$44 million in licensing deals (aiwatch.dog)

Informa plc

OpenAI

Aug 2024

$10m

Reuters

Meta

Oct 2024

Not disclosed

HarperCollins

Microsoft

Nov 2024

Not disclosed

Associated Press (AP)

Google

Jan 2025

Not disclosed

Getty Images, Shutterstock

Getty Images, Shutterstock

Jan 2025

$3.7 billion merger (turn0news1)

Protect your content, protect your revenue

These agreements signal an important shift in the way digital content is valued and raise a question about how it can be protected. Not every company has the leverage of an established media giant to fight unwanted scrapers in court and in the press, or to force licensing agreements. Many businesses, from niche research firms to independent publishers, are being left out of such negotiations and their content is freely scraped and repurposed without consent.  

We recently wrote about the ineffectiveness of robots.txt as a way of preventing scraper bots from crawling your web estate. Many scraper bots also exhibit unethical and downright malicious behaviors, lying about their purpose, changing their user agent, and using residential IP addresses as proxies, all while increasing traffic load on your site with repeated and high-density crawling.  

For businesses hit by these crawlers, the implications are serious. For AI scraper to ingest vast amounts of human-generated content without permission, they effectively bypass paywalls, subscription models, and advertising-based revenue streams. They add cost to your infrastructure bill and steal content to be repurposed in their own products.  

The result is a phenomenon where businesses and creators invest in making valuable content, but AI companies monetize it. At the same time, AI-generated summaries or reimaginings of this scraped content reduce the need for users to visit the original sources, further diminishing traffic and revenue. 

Beyond financial loss, there is also the issue of intellectual property erosion. Proprietary research, expert insights, in-depth analysis, and art are among the most valuable digital assets a business can produce or license. If AI-generated content, trained on scraped data, can deliver a reimagined version of these insights at scale, businesses risk losing their competitive edge. 

AI scraping continues, even as licensing deals are struck 

Despite these high-profile licensing agreements, unauthorized content scraping remains widespread.  

An additional twist in the story is that some licensing deals are believed to be exclusive, or at least preferential, raising the question of whether content owners are responsible for enforcing this exclusivity by preventing scraper bots sent by rival firms from accessing the content. 

Whatever the case, AI firms continue to extract data from websites without permission, using bots to systematically harvest news articles, research papers, blogs, images and video at scale. 

How businesses can protect and profit from original content 

As content theft through scraping continues, businesses must take proactive measures to protect their property before it’s too late. The first step is recognizing that many AI companies will only pay for content when they are forced to. If access is unrestricted, unlimited scraping remains the default method. 

Companies should implement detection and prevention measures to identify scraper bots in real time. Unlike human visitors, scrapers follow identifiable patterns—they move faster, request large volumes of pages, and crawl paths systematically.  

Sophisticated scraper bots looking to disguise their activity may rate limit their behavior and use rotating user agents, IP addresses or even residential proxies.  

Advanced bot detection solutions can help businesses differentiate legitimate visitor traffic and even licensed scraper bots from unauthorized content harvesting. 

Beyond detection, businesses can use sophisticated anti-bot solutions to reinforce tighter access controls like CAPTCHA challenges that can significantly slow down scrapers or even send bots to alternative content such as a licensing information page or even ‘fake’ content paths.  

For organizations that produce particularly high-value content, licensing should be a top priority. Instead of allowing AI firms to scrape their data unchecked, businesses can explore ways to structure formal agreements that ensure proper compensation. Just as large publishers have begun to monetize their archives, research firms, thought leadership platforms, and specialized content providers should position themselves as essential data sources rather than passive contributors. 

The future of AI and content ownership 

The increasing reliance on AI-generated content only reinforces the value of human creativity and expertise. As businesses grapple with the implications of AI scraping, the need to protect intellectual property becomes more urgent than ever. Those that take proactive steps – whether through bot mitigation, licensing negotiations, or stronger content protection measures – will be the ones that retain control over their digital assets and precious revenue streams in the long run. 

AI companies have already shown a willingness to pay for content, but they won’t do so unless they have no other choice. Businesses must ask themselves: are they securing their place in this new content economy, or are they unknowingly fuelling AI models for free? 

For companies that depend on digital content, research, and proprietary insights, bot mitigation isn’t just a cybersecurity concern—it’s a business imperative. If you’re not actively monitoring how your content is being used, there’s a high chance it’s already being scraped. 

Block Bots Effortlessly with Netacea

Book a demo and see how Netacea autonomously prevents sophisticated automated attacks.
Book

Related Blogs

13/03/25

X-Ray Specs: A Look Inside Trading Card Scalper Innovation

Blog
Blog
Threat Research Team | 
13/03/25
Scalpers targeting trading card releases isn’t new, but their rise in sophistication is, with new refund fraud techniques targeting retailers.
Fingerprint
26/02/25

What is a ‘Sophisticated Bot Attack’? 

Blog
Blog
Threat Research Team | 
26/02/25
What is a sophisticated attack and how do you know you’ve got a problem with sophisticated bot attacks?
PS5
12/02/25

The Evolution of Scalper Bots Part 7: The Next Generation Console Era

Blog
Blog
Threat Research Team | 
12/02/25
Learn how scalper bots turned into a multi-million-dollar industry during the chip shortage and pandemic disruptions.

Block Bots Effortlessly with Netacea

Demo Netacea and see how our bot protection software autonomously prevents the most sophisticated and dynamic automated attacks across websites, apps and APIs.
  • Agentless, self managing spots up to 33x more threats
  • Automated, trusted defensive AI. Real-time detection and response
  • Invisible to attackers. Operates at the edge, deters persistent threats

Book a Demo