Published: 21/05/2021

Web Crawler

What is a web crawler? A web crawler, also known as a spider or spiderbot, is an Internet bot that systematically browses the World Wide Web.

What is site crawling? It is when web crawlers are used to gather information from the internet. For example, a search engine may index content found on websites and provide that content in response to queries submitted by users.

What are the main use cases of web crawlers

Indexing

The main goal of web crawling is to provide an up-to-date directory and database of all available sites on the Web. For example, Google’s crawlers visit billions of pages per day, examining each page for links to others that have not yet been discovered in order to create an up-to-date list of available websites. The frequency with which Google visits a particular site can vary from minutes for large sites, to years for very small ones.

Data Mining

Web crawling can also be used in data mining, which is the process of extracting information from large volumes of data, usually in order to identify useful patterns, knowledge or insights about hidden relationships within a dataset.

Site Health

Web crawlers also help webmasters find broken hyperlinks on their pages and fix them, or identify when the content on one page no longer matches that of a linked page on an external website.

Security

Web crawlers may also be employed for security reasons, for example, to verify links which appear online but do not actually lead anywhere, or links that appear manipulated with the intent to mislead users about their destination and purpose.

How does a web crawler work

A web crawler starts with a list of URLs to visit, called the spider’s start page. The spider visits each URL in sequence. It looks at what it finds and does one or more of these activities:

  • Copies links from that page into its starting point (the spider’s start page)
  • Follows those links recursively until all pages have been visited
  • Adds any new pages found along the way that aren’t already in its database.
  • Extracts information from those pages, typically including text and hyperlinks, according to a set of instructions encoded within them

Good web crawlers vs bad web crawlers

  • Good web crawlers go through all of your website’s pages so as to find out what is new or updated with no malicious intent; they don’t steal any data from you and do not violate privacy policies.
  • Bad web crawlers continuously visit your website because their aim is to get private personal information about its visitors without having permission. These areas of interest should be closely monitored for security purposes because if these intruders find something sensitive like credit card numbers it could lead to seriously compromising someone’s identity or financial status.

How to block web crawlers and prevent them from entering your site

  • Utilize the Robots.txt exclusion protocol (REP) to prevent search engines from crawling your site and indexing it on their results pages
  • Replace content on public-facing websites that you wish not to be accessed by spiders with a 403 Forbidden error message when detected or force an SSL connection for all traffic to avoid spider access completely
  • Block domain names of known scrapers and downloader agents such as GooglebotBaiduspiderBingBot, etc. using an IP filtering agent

Schedule Your Demo

Tired of your website being exploited by malicious malware and bots?

We can help

Subscribe and stay updated

Insightful articles, data-driven research, and more cyber security focussed content to your inbox every week.

Required
Required

By registering, you confirm that you agree to Netacea's privacy policy.