• Resources
  • Blogs
  • AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

AI’s Content Gold Rush: Who’s Getting Paid, Who’s Getting Scraped, and How Businesses Can Turn Content into Revenue

Netacea logo
Threat Research Team
02/04/25
5 Minute read
What is Content Scraping and How Does it Affect Your Business?

Article Contents

    The AI boom is creating a new content economy – one where savvy content owners are striking multi-million-dollar licensing deals, while others are being automatically scraped by bots to train AI models for free.  

    It’s impossible not to have noticed the biggest names in AI, including OpenAI, Google, Anthropic, Perplexity and more, at the center of an argument about ethical content scraping. 

    As AI companies race to train their large language models (LLMs) on original human-generated creations, content-rich organizations – publishers, research firms, media houses, stock imagery providers and content archives – find themselves dragged into an unregulated gold rush. 

    Some archives and publishers have turned this demand into an opportunity, negotiating lucrative contracts with AI firms that recognize the value of high-quality, human-generated content. Others, however, are being left behind, their work automatically scraped and fed into AI models without compensation or acknowledgment. This source material is then used to generate derivative works that are commercialized by the AI firms, stealing potential revenue from original content owners.  

    But regardless of whether you’re currently able to monetise your content or not, it’s imperative that you protect it before it becomes fair game for both well-behaved and unethical scraper bots. 

    So, how can businesses protect their content from AI-driven extraction before it’s too late? 

    The divide between those who profit and those who lose 

    AI firms have made it clear that they need high-quality original content to train their LLMs in both broad and niche subjects. This could be a historical archive or world news, forums full of user generated questions and answers, images and videos in specific styles, or a library of specialist journals. It’s becoming clear that training AI on AI-generated content will cause the models to collapse, making original material more valuable.  

    The scandal is that once these LLMs are trained on datasets, the AI models they fuel can create derivative works that could be sold by the AI company.  

    Recognizing this, major publishers have leveraged their intellectual property to secure substantial, and in some cases exclusive, licensing agreements. Companies like Condé Nast, home to Vogue and The New Yorker, have signed multi-year deals with OpenAI, while News Corp has secured a $250 million agreement for OpenAI scrapers to access its archive. 

    Major licensing deals between content owners and AI companies

    Content Owner

    AI Company

    Deal Date

    Estimated Value

    Associated Press (AP)

    OpenAI

    Jul 2023

    Not disclosed

    Shutterstock

    OpenAI, Meta

    Jul 2023

    Not disclosed

    Axel Springer

    OpenAI

    Dec 2023

    “Tens of millions of euros” (aiwatch.dog)

    Reddit

    Google

    Feb 2024

    Approximately $60 million/year (aiwatch.dog)

    Automattic

    OpenAI, Midjourney

    Feb 2024

    Not disclosed

    Wiley

    Unknown

    Mar 2024

    $23 million (aiwatch.dog)

    Le Monde, Prisa Media

    OpenAI

    Mar 2024

    Not disclosed

    Financial Times

    OpenAI

    Apr 2024

    Between $5 million and $10 million/year (aiobserver.co)

    Stack Overflow

    OpenAI

    May 2024

    Not disclosed

    Dotdash Meredith

    OpenAI

    May 2024

    At least $16 million (aiobserver.co)

    Taylor & Francis

    Microsoft

    May 2024

    Almost £8 million ($10 million) in the first year (aiwatch.dog)

    Reddit

    OpenAI

    May 2024

    Not disclosed

    NewsCorp

    OpenAI

    May 2024

    Over $250 million over five years (aiwatch.dog)

    Vox Media

    OpenAI

    May 2024

    Not disclosed

    The Atlantic

    OpenAI

    May 2024

    Not disclosed

    Time

    OpenAI

    Jun 2024

    Not disclosed

    Financial Times, Axel Springer, The Atlantic, Fortune, Universal Music Group

    ProRata.ai

    Aug 2024

    Not disclosed

    Condé Nast

    OpenAI

    Aug 2024

    Not disclosed

    Wiley

    Unknown

    Aug 2024

    $44 million in licensing deals (aiwatch.dog)

    Informa plc

    OpenAI

    Aug 2024

    $10m

    Reuters

    Meta

    Oct 2024

    Not disclosed

    HarperCollins

    Microsoft

    Nov 2024

    Not disclosed

    Associated Press (AP)

    Google

    Jan 2025

    Not disclosed

    Getty Images, Shutterstock

    Getty Images, Shutterstock

    Jan 2025

    $3.7 billion merger (turn0news1)

    Protect your content, protect your revenue

    These agreements signal an important shift in the way digital content is valued and raise a question about how it can be protected. Not every company has the leverage of an established media giant to fight unwanted scrapers in court and in the press, or to force licensing agreements. Many businesses, from niche research firms to independent publishers, are being left out of such negotiations and their content is freely scraped and repurposed without consent.  

    We recently wrote about the ineffectiveness of robots.txt as a way of preventing scraper bots from crawling your web estate. Many scraper bots also exhibit unethical and downright malicious behaviors, lying about their purpose, changing their user agent, and using residential IP addresses as proxies, all while increasing traffic load on your site with repeated and high-density crawling.  

    For businesses hit by these crawlers, the implications are serious. For AI scraper to ingest vast amounts of human-generated content without permission, they effectively bypass paywalls, subscription models, and advertising-based revenue streams. They add cost to your infrastructure bill and steal content to be repurposed in their own products.  

    The result is a phenomenon where businesses and creators invest in making valuable content, but AI companies monetize it. At the same time, AI-generated summaries or reimaginings of this scraped content reduce the need for users to visit the original sources, further diminishing traffic and revenue. 

    Beyond financial loss, there is also the issue of intellectual property erosion. Proprietary research, expert insights, in-depth analysis, and art are among the most valuable digital assets a business can produce or license. If AI-generated content, trained on scraped data, can deliver a reimagined version of these insights at scale, businesses risk losing their competitive edge. 

    AI scraping continues, even as licensing deals are struck 

    Despite these high-profile licensing agreements, unauthorized content scraping remains widespread.  

    An additional twist in the story is that some licensing deals are believed to be exclusive, or at least preferential, raising the question of whether content owners are responsible for enforcing this exclusivity by preventing scraper bots sent by rival firms from accessing the content. 

    Whatever the case, AI firms continue to extract data from websites without permission, using bots to systematically harvest news articles, research papers, blogs, images and video at scale. 

    How businesses can protect and profit from original content 

    As content theft through scraping continues, businesses must take proactive measures to protect their property before it’s too late. The first step is recognizing that many AI companies will only pay for content when they are forced to. If access is unrestricted, unlimited scraping remains the default method. 

    Companies should implement detection and prevention measures to identify scraper bots in real time. Unlike human visitors, scrapers follow identifiable patterns—they move faster, request large volumes of pages, and crawl paths systematically.  

    Sophisticated scraper bots looking to disguise their activity may rate limit their behavior and use rotating user agents, IP addresses or even residential proxies.  

    Advanced bot detection solutions can help businesses differentiate legitimate visitor traffic and even licensed scraper bots from unauthorized content harvesting. 

    Beyond detection, businesses can use sophisticated anti-bot solutions to reinforce tighter access controls like CAPTCHA challenges that can significantly slow down scrapers or even send bots to alternative content such as a licensing information page or even ‘fake’ content paths.  

    For organizations that produce particularly high-value content, licensing should be a top priority. Instead of allowing AI firms to scrape their data unchecked, businesses can explore ways to structure formal agreements that ensure proper compensation. Just as large publishers have begun to monetize their archives, research firms, thought leadership platforms, and specialized content providers should position themselves as essential data sources rather than passive contributors. 

    The future of AI and content ownership 

    The increasing reliance on AI-generated content only reinforces the value of human creativity and expertise. As businesses grapple with the implications of AI scraping, the need to protect intellectual property becomes more urgent than ever. Those that take proactive steps – whether through bot mitigation, licensing negotiations, or stronger content protection measures – will be the ones that retain control over their digital assets and precious revenue streams in the long run. 

    AI companies have already shown a willingness to pay for content, but they won’t do so unless they have no other choice. Businesses must ask themselves: are they securing their place in this new content economy, or are they unknowingly fuelling AI models for free? 

    For companies that depend on digital content, research, and proprietary insights, bot mitigation isn’t just a cybersecurity concern—it’s a business imperative. If you’re not actively monitoring how your content is being used, there’s a high chance it’s already being scraped. 

    Block Bots Effortlessly with Netacea

    Book a demo and see how Netacea autonomously prevents sophisticated automated attacks.
    Book

    Related Blogs

    13/03/25

    X-Ray Specs: A Look Inside Trading Card Scalper Innovation

    Blog
    Blog
    Threat Research Team | 
    13/03/25
    Scalpers targeting trading card releases isn’t new, but their rise in sophistication is, with new refund fraud techniques targeting retailers.
    Fingerprint
    26/02/25

    What is a ‘Sophisticated Bot Attack’? 

    Blog
    Blog
    Threat Research Team | 
    26/02/25
    What is a sophisticated attack and how do you know you’ve got a problem with sophisticated bot attacks?
    PS5
    12/02/25

    The Evolution of Scalper Bots Part 7: The Next Generation Console Era

    Blog
    Blog
    Threat Research Team | 
    12/02/25
    Learn how scalper bots turned into a multi-million-dollar industry during the chip shortage and pandemic disruptions.

    Block Bots Effortlessly with Netacea

    Demo Netacea and see how our bot protection software autonomously prevents the most sophisticated and dynamic automated attacks across websites, apps and APIs.
    • Agentless, self managing spots up to 33x more threats
    • Automated, trusted defensive AI. Real-time detection and response
    • Invisible to attackers. Operates at the edge, deters persistent threats

    Book a Demo