Resources
Blogs
AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

Threat Research Team

02/04/25

5 Minute read

What is Content Scraping and How Does it Affect Your Business?

Article Contents

The AI boom is creating a new content economy – one where savvy content owners are striking multi-million-dollar licensing deals, while others are being automatically scraped by bots to train AI models for free.

It’s impossible not to have noticed the biggest names in AI, including OpenAI, Google, Anthropic, Perplexity and more, at the center of an argument about ethical content scraping.

As AI companies race to train their large language models (LLMs) on original human-generated creations, content-rich organizations – publishers, research firms, media houses, stock imagery providers and content archives – find themselves dragged into an unregulated gold rush.

Some archives and publishers have turned this demand into an opportunity, negotiating lucrative contracts with AI firms that recognize the value of high-quality, human-generated content. Others, however, are being left behind, their work automatically scraped and fed into AI models without compensation or acknowledgment. This source material is then used to generate derivative works that are commercialized by the AI firms, stealing potential revenue from original content owners.

But regardless of whether you’re currently able to monetise your content or not, it’s imperative that you protect it before it becomes fair game for both well-behaved and unethical scraper bots.

So, how can businesses protect their content from AI-driven extraction before it’s too late?

The divide between those who profit and those who lose

AI firms have made it clear that they need high-quality original content to train their LLMs in both broad and niche subjects. This could be a historical archive or world news, forums full of user generated questions and answers, images and videos in specific styles, or a library of specialist journals. It’s becoming clear that training AI on AI-generated content will cause the models to collapse, making original material more valuable.

The scandal is that once these LLMs are trained on datasets, the AI models they fuel can create derivative works that could be sold by the AI company.

Recognizing this, major publishers have leveraged their intellectual property to secure substantial, and in some cases exclusive, licensing agreements. Companies like Condé Nast, home to Vogue and The New Yorker, have signed multi-year deals with OpenAI, while News Corp has secured a $250 million agreement for OpenAI scrapers to access its archive.

Major licensing deals between content owners and AI companies

Content Owner	AI Company	Deal Date	Estimated Value
Associated Press (AP)	OpenAI	Jul 2023	Not disclosed
Shutterstock	OpenAI, Meta	Jul 2023	Not disclosed
Axel Springer	OpenAI	Dec 2023	“Tens of millions of euros” (aiwatch.dog)
Reddit	Google	Feb 2024	Approximately $60 million/year (aiwatch.dog)
Automattic	OpenAI, Midjourney	Feb 2024	Not disclosed
Wiley	Unknown	Mar 2024	$23 million (aiwatch.dog)
Le Monde, Prisa Media	OpenAI	Mar 2024	Not disclosed
Financial Times	OpenAI	Apr 2024	Between $5 million and $10 million/year (aiobserver.co)
Stack Overflow	OpenAI	May 2024	Not disclosed
Dotdash Meredith	OpenAI	May 2024	At least $16 million (aiobserver.co)
Taylor & Francis	Microsoft	May 2024	Almost £8 million ($10 million) in the first year (aiwatch.dog)
Reddit	OpenAI	May 2024	Not disclosed
NewsCorp	OpenAI	May 2024	Over $250 million over five years (aiwatch.dog)
Vox Media	OpenAI	May 2024	Not disclosed
The Atlantic	OpenAI	May 2024	Not disclosed
Time	OpenAI	Jun 2024	Not disclosed
Financial Times, Axel Springer, The Atlantic, Fortune, Universal Music Group	ProRata.ai	Aug 2024	Not disclosed
Condé Nast	OpenAI	Aug 2024	Not disclosed
Wiley	Unknown	Aug 2024	$44 million in licensing deals (aiwatch.dog)
Informa plc	OpenAI	Aug 2024	$10m
Reuters	Meta	Oct 2024	Not disclosed
HarperCollins	Microsoft	Nov 2024	Not disclosed
Associated Press (AP)	Google	Jan 2025	Not disclosed
Getty Images, Shutterstock	Getty Images, Shutterstock	Jan 2025	$3.7 billion merger (turn0news1)

Protect your content, protect your revenue

These agreements signal an important shift in the way digital content is valued and raise a question about how it can be protected. Not every company has the leverage of an established media giant to fight unwanted scrapers in court and in the press, or to force licensing agreements. Many businesses, from niche research firms to independent publishers, are being left out of such negotiations and their content is freely scraped and repurposed without consent.

We recently wrote about the ineffectiveness of robots.txt as a way of preventing scraper bots from crawling your web estate. Many scraper bots also exhibit unethical and downright malicious behaviors, lying about their purpose, changing their user agent, and using residential IP addresses as proxies, all while increasing traffic load on your site with repeated and high-density crawling.

For businesses hit by these crawlers, the implications are serious. For AI scraper to ingest vast amounts of human-generated content without permission, they effectively bypass paywalls, subscription models, and advertising-based revenue streams. They add cost to your infrastructure bill and steal content to be repurposed in their own products.

The result is a phenomenon where businesses and creators invest in making valuable content, but AI companies monetize it. At the same time, AI-generated summaries or reimaginings of this scraped content reduce the need for users to visit the original sources, further diminishing traffic and revenue.

Beyond financial loss, there is also the issue of intellectual property erosion. Proprietary research, expert insights, in-depth analysis, and art are among the most valuable digital assets a business can produce or license. If AI-generated content, trained on scraped data, can deliver a reimagined version of these insights at scale, businesses risk losing their competitive edge.

AI scraping continues, even as licensing deals are struck

Despite these high-profile licensing agreements, unauthorized content scraping remains widespread.

An additional twist in the story is that some licensing deals are believed to be exclusive, or at least preferential, raising the question of whether content owners are responsible for enforcing this exclusivity by preventing scraper bots sent by rival firms from accessing the content.

Whatever the case, AI firms continue to extract data from websites without permission, using bots to systematically harvest news articles, research papers, blogs, images and video at scale.

How businesses can protect and profit from original content

As content theft through scraping continues, businesses must take proactive measures to protect their property before it’s too late. The first step is recognizing that many AI companies will only pay for content when they are forced to. If access is unrestricted, unlimited scraping remains the default method.

Companies should implement detection and prevention measures to identify scraper bots in real time. Unlike human visitors, scrapers follow identifiable patterns—they move faster, request large volumes of pages, and crawl paths systematically.

Sophisticated scraper bots looking to disguise their activity may rate limit their behavior and use rotating user agents, IP addresses or even residential proxies.

Advanced bot detection solutions can help businesses differentiate legitimate visitor traffic and even licensed scraper bots from unauthorized content harvesting.

Beyond detection, businesses can use sophisticated anti-bot solutions to reinforce tighter access controls like CAPTCHA challenges that can significantly slow down scrapers or even send bots to alternative content such as a licensing information page or even ‘fake’ content paths.

For organizations that produce particularly high-value content, licensing should be a top priority. Instead of allowing AI firms to scrape their data unchecked, businesses can explore ways to structure formal agreements that ensure proper compensation. Just as large publishers have begun to monetize their archives, research firms, thought leadership platforms, and specialized content providers should position themselves as essential data sources rather than passive contributors.

The future of AI and content ownership

The increasing reliance on AI-generated content only reinforces the value of human creativity and expertise. As businesses grapple with the implications of AI scraping, the need to protect intellectual property becomes more urgent than ever. Those that take proactive steps – whether through bot mitigation, licensing negotiations, or stronger content protection measures – will be the ones that retain control over their digital assets and precious revenue streams in the long run.

AI companies have already shown a willingness to pay for content, but they won’t do so unless they have no other choice. Businesses must ask themselves: are they securing their place in this new content economy, or are they unknowingly fuelling AI models for free?

For companies that depend on digital content, research, and proprietary insights, bot mitigation isn’t just a cybersecurity concern—it’s a business imperative. If you’re not actively monitoring how your content is being used, there’s a high chance it’s already being scraped.

Block Bots Effortlessly with Netacea

Book a demo and see how Netacea autonomously prevents sophisticated automated attacks.

Book

Related Blogs

View All Blogs

OWASP Announces BLADE Business Logic Attack Framework to Give Enterprises Better Tools to Fight Sophisticated Bots

Blog

Threat Research Team |

28/04/25

The BLADE Framework is a “MITRE ATT&CK style” framework to help cyber defenders understand and respond to business logic abuse through a matrix of TTPs.

Read now

X-Ray Specs: A Look Inside Trading Card Scalper Innovation

Blog

Threat Research Team |

13/03/25

Scalpers targeting trading card releases isn’t new, but their rise in sophistication is, with new refund fraud techniques targeting retailers.

Read now

What is a ‘Sophisticated Bot Attack’?

Blog

Threat Research Team |

26/02/25

What is a sophisticated attack and how do you know you’ve got a problem with sophisticated bot attacks?

Read now

View All Blogs

Block Bots Effortlessly with Netacea

Demo Netacea and see how our bot protection software autonomously prevents the most sophisticated and dynamic automated attacks across websites, apps and APIs.

Agentless, self managing spots up to 33x more threats
Automated, trusted defensive AI. Real-time detection and response
Invisible to attackers. Operates at the edge, deters persistent threats

AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

Article Contents

The divide between those who profit and those who lose

Major licensing deals between content owners and AI companies

Protect your content, protect your revenue

AI scraping continues, even as licensing deals are struck

How businesses can protect and profit from original content

The future of AI and content ownership

Block Bots Effortlessly with Netacea

Related Blogs

OWASP Announces BLADE Business Logic Attack Framework to Give Enterprises Better Tools to Fight Sophisticated Bots

OWASP Announces BLADE Business Logic Attack Framework to Give Enterprises Better Tools to Fight Sophisticated Bots

X-Ray Specs: A Look Inside Trading Card Scalper Innovation

X-Ray Specs: A Look Inside Trading Card Scalper Innovation

What is a ‘Sophisticated Bot Attack’?

What is a ‘Sophisticated Bot Attack’?

Block Bots Effortlessly with Netacea

Book a Demo