Combating Content Theft: Maximize Revenue by Securing Your Content
Content scraping is on the rise. While it can benefit your business in some cases, it can also lead to lost revenue, degraded website performance, and content theft.
Web scraping is a hot topic in tech news. This trend links to the rise in AI tools, specifically LLMs (large language models), which rely on content to generate their outputs. They scrape content from across the web to train these algorithms.
This is a controversial subject with moral, technical, and legal implications. Bots scrape websites for various reasons, not all negative or unwanted.
How Is Content Theft Affecting Your Business?
Before deciding how to react, you must understand which bots are scraping your site and why. Bots could harm your website by damaging customer experience or revenue. However, their presence might also indicate bigger revenue opportunities.
In this post, we’ll explore the dual nature of web scraping, the rise of AI scraping tools, and strategies to protect your content and revenue.
What Is Web Scraping?
According to the BLADE Framework, web scraping “is the use of bots to gather content or data from websites. Website administrators welcome some scraper bots that are beneficial. However, some scraper bots, such as content and pricing scrapers, can have malicious intentions and harm businesses and customers.”
As with all bots, scrapers are only as good or bad as the intent of the humans who control them. A blanket ban of all scraper bots from your site could do more harm than good.
What Are Some Positive Uses for Web Scraping?
- SEO: Search engines scrape content to index your site, helping you rank in search results and attract organic traffic.
- Content syndication and aggregation: Other sites can automatically link back to yours by using scraping to spread your content to new audiences.
- Performance and security scanning: Third parties can help ensure your site is stable and operating as it should via scraping.
- Monetization: Many website owners have struck lucrative deals with scraper bot operators, e.g., OpenAI or Google, to allow scraping of their content.
What Are Some Harmful Uses for Web Scraping?
- Price undercutting: Competitors can scrape and undercut your prices to steal away price-sensitive customers.
- Performance issues and outages: Scraping can overload web servers, especially during busy times or when multiple bots hit your site at once, leading to increased costs or downtime.
- Gateway to other attacks: Scraping is the first stage of many attacks, including website cloning for fraud, product scalping, arbitrage betting, and more.
- Content theft: Content is extremely valuable as a revenue stream, drawing visitors to websites to display advertising or earn subscription fees. Loss of this for unintended purposes or audiences is harmful to the businesses that own these content rights.
When Does Web Scraping Become Content Theft?
Web scraping becomes content theft when it involves unauthorized extraction and use of content that the creator has invested time, effort, and resources into producing.
If the content owner expects control over its use, scraping without permission can be considered theft. This includes uses for commercial purposes, IP rights, or specific access terms, especially for replication, commercial benefit, or AI training. The key factor is the content owner’s intent and whether the scraping undermines their ability to control or profit from their content.
In contrast, if the content is openly available for public use and its owner permits broad access without restrictions, scraping might not constitute theft. Ultimately, content theft occurs when scraping violates the owner’s terms, bypasses barriers meant to limit access, or repurposes content for unauthorized uses.
Legal Implications of Web Scraping and Content Theft
Legal challenges over content scraping have had mixed results in the courts. In the case of LinkedIn vs. HiQ, who scraped LinkedIn profile data, a six-year court battle ended in favor of LinkedIn. However, this was because HiQ created fake LinkedIn accounts, breaching the site’s terms, to scrape member-only information.
In the case between Meta and Bright Data, the latter came out on top because they scraped publicly available information from Facebook that was ungated.
Due to the time and resources it requires, relying on legal action is often impractical, making it an ineffective immediate solution. Therefore, content owners must proactively enforce their rights by defining acceptable use and implementing technological defenses.
The Rise of Gen AI Is Fueling an Increase in Content Theft
While OpenAI’s ChatGPT is well-established in mainstream society, many more generative AI tools are available. All of these rely on huge datasets to generate content, whether text-based, images, or raw data.
To keep their databases—and thus their outputs—current and accurate, generative AI tools must constantly scrape the internet for existing content.
This creates several issues. Firstly, AI-generated content is inundating the web, rapidly overtaking human-made content. AI thought leader Nina Schick predicts that by 2025, AI will have generated 90% of all web content. AI risks cannibalizing itself by feeding AI-generated content back into its models, leading to lower quality outputs.
Secondly, the insatiable hunger of Gen AI tools for fresh content has ramped up web scraping activity across the web substantially. Some businesses estimate that serving traffic to AI-controlled bots costs them as much as $1,000 a day.
Why It’s Important to Protect Your Content
Now more than ever, human-created content is at a premium. It takes time, money, and resources to create. Gen AI tools need human-created content to deliver quality outputs. Other parties are using automated content theft to repurpose and monetize others’ content without permission. This in turn diminishes the value of the original work, harming businesses and individuals with intellectual property rights.
But content owners can also leverage this demand to create new revenue streams. Recognizing this, OpenAI has made content syndication deals with publishers like Associated Press, Reddit, Time, and News Corp. These deals give OpenAI permission to scrape and use content to augment ChatGPT.
Is Robots.txt Enough to Stop Web Scraper Bots?
Webmasters would previously rely on updating robots.txt to instruct which bots could crawl which areas of their sites. While legitimate players like OpenAI and Meta will respect robots.txt, less reputable ones are under no obligation to obey it. Think of it like a “keep off the grass” sign—unlikely to stop those determined to walk on the grass.
OpenAI is the biggest player currently, but there are over 2,000 generative AI tools on the market, with this total growing constantly. Even if all these did obey robots.txt, it would be impossible to keep the file up to date now.
Sophistication of Scraping Tools Is Increasing
Netacea’s Threat Intel Center continually researches new developments in the bot ecosystem across thousands of communities and reverse-engineers bots to see how they work.
Scraper bots are now an easily accessible commodity. People can buy them cheaply and use them effectively without specialist skills or knowledge, or even use them “as a service.” Many come bundled to scrape specific sites and bypass known bot defenses, including popular CDNs and many client-based bot management tools.
How to Identify and Monetize Demand for Your Content
Businesses need to take a more granular and detailed approach to identifying the bots hitting their sites. While legitimate bots like OpenAI or Google identify themselves with unique user agents and IP addresses, many others spoof their user agents and rotate IPs. They do this to obscure their identity and intent.
At Netacea, we use server-side data collection, invisibly correlating thousands of data points on each incoming request to discern the true intent of each request. We cross-reference user agents and IP addresses with external data and internal intelligence feeds. We also analyze behavior patterns over time to categorize visitors accordingly.
With this data, our customers have the option to approach the owners of bots we’ve identified with commercial terms to continue scraping their content. This allows them to potentially open a new revenue stream. They can also choose to block these bots or limit their access as needed to protect content and infrastructure capacity during peak times.
Explore Content Theft Issues in Our Webinar
We recently hosted a webinar focusing on web scraping and content theft. You can watch this on-demand now to gain valuable insights from various perspectives:
- Lizzy Eccles (Principal Consultant), who works closely with clients to help understand bot traffic affecting their operations.
- Cyril Noel-Tagoe (Principal Security Researcher), who investigates attacker communities to stay ahead of emerging bot threats.
- Kaylea Haynes (Lead Data Scientist), who develops tools and algorithms used to classify and control bot traffic across our customers’ websites, apps, and APIs.