WebCrawler sample

This page is setup to show Python's ability to be used as a WebCrawler. A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. Google, Bing, DuckDuckGo and all other Web search engines implement a form of WebCrawler to use in their indexes. This is a simple, real-time example written in Python. it only scans for links setup using HtML anchor tags and href attributes and is limited to 25 overall links to scan within the website but it does show the ability to perform this task simply and effectively with Python.

In the text box below, type in the website to scan, the number (1 to 25) of links to scan and click submit. Then, The process will review the website entered and will make a list of all links found within that first link. It will then scan that second layer of links and third, etc... It will do this until it either runs out of links to scan or it hits the limit entered.

code snippet that handles the crux of the process

The code for this snippet is fairly straightforward. The code retrieves the HTML for the given link. The regular expression (LINK_REGEX) at the top looks for all Anchor tags with an href attribute in the HTML and stores them in a Queue. Once all the links are saved in our Queue, the code processes each sublink to find more links. The code will not scan the same link twice by checking to make sure we do not add it to the Queue more than once and keeping track of visited links. It will also clean up links so that we eliminate duplicates. For example, www.google.com, http://www.google.com, https://www.google.com and http://www.google.com/ will all be seen as the same link. The code stops when the Queue finally empties or we reach the maximum number of levels of links to be scanned. A simple page like the google home page has hundreds of links depending on how many levels you choose.