A robot is an automated tool for exploring and retrieving files from the web. You can find out more about how robots work in Chapter 8 of my book The Web Architect's Handbook, or The World Wide Web Robots, Wanderers, and Spiders home page.

This page contains links to robot-related programs I have written.

Obsolete projects

Withdrawn. I wrote this robot in 1993/94, way back when the web was a lot smaller than it is today (and I knew a lot less about perl). It's not robust and should not be trusted. If you insist, the sources are available here.

Websnarf was written as a learning exercise in Perl (4.036 at that time). It was also intended as an experiment in web document mirroring. Better mirror tools are now readily available. One development goal was never realized; this was for websnarf to automagically re-write local URLs within retrieved documents so that a downloaded web would be useable on the machine websnarf was running on. (I ran into problems relating to arbitrary record length limits in dbm files, and had to abandon the project).

Current Projects

These are a collection of small robots and robot-like tools I am currently writing. They are implemented in Perl, using the highly sophisticated object-oriented toolkits for web and http access that are available from CPAN (the Combined Perl Archive Network).

I'm writing these tools partly as a learning exercise in object-oriented programming, and partly because they may serve a useful function in illuminating several areas; notably how robots work, how to develop software tools using rapid prototyping in perl, and how web structure can be mapped from a central point.

Two design features of the robots below need emphasizing:

  1. These are HTTP-based robots; that is, they retrieve files via the hypertext transport protocol (and any other protocols that the Perl LWP installation knows about). Thus, they can only identify and test links. They say nothing about the actual filesystem stored on a given web server; they can't tell you, for example, if you have a problem with orphan files (files not linked to from elsewhere in your web), or if you have a problem with file partitioning (for example, twelve thousand files stored in one directory).
  2. These robots do not store any information about the documents they retrieve for future use. While you could modify CharlieSpider/0.3 to parse HTML files and do something with their contents fairly easily, that's not a central design goal. Rather, the robots are designed to probe for broken links. One consequence is that CharlieSpiders 0.2 and 0.3 only send HEAD requests to sites other than the one hosting their starting URL. Another issue is that to convert them into search robots, it would be necessary to build some kind of caching mechanism so that URLs that have been previously probed would thereafter be left alone (at least until any Expires: directive has expired).

A tool for verifying links via the web. CharlieSpider is a UNIX command-line tool; it takes a single parameter (a URL to check). Given a valid URL, CharlieSpider retrieves the target file, parses it as HTML, extracts any links in it, expands them (if they are relative URLs), and tests whether they are available by issuing a HEAD request. It then reports on what it found.

This program is not an HTML checker, but provides a quick check for link integrity in HTML pages. (It is the functional core of the more sophisticated tools described below).

A variant on CharlieSpider/0.1, for verifying local websites. Given a valid URL, CharlieSpider/0.2 retrieves the target file, parses it as HTML, extracts any links in it, expands them (if they are relative URLs), and tests whether they are available by issuing a HEAD request. It then repeats the process recursively on any local links it found, and reports on any failed HTTP requests (which are often indicative of broken links in the web).

A variant on this has been re-implemented as a CGI script with a forms-based interface. Source code.

A variant on CharlieSpider/0.2, this version runs as a CGI script and obeys the robot exclusion protocol. This is particularly important, as it means that CharlieSpider/0.3 is sufficiently well house-trained to let out on the web at large; it won't go where it isn't wanted. You can use it, or see the annotated source code.

Fixing a bug

There are two subtle bugs in the three robots above:
  1. Runaway recursion
    Suppose you tell CharlieSpider/0.3 to probe from this web page. It's reasonable to expect it to drill down, below this page (http://$server/webbook/robot/index.html). It is somewhat less reasonable to expect it to start probing other durectories, at higher levels on the server. But this site 'leaks'; it has back-references pointing to higher levels (http://$server/webbook/), and consequently the search can spread out into directories it was never meant to explore.
  2. Base handing
    The spiders described above take a current URL, retrieve the document it refers to, and build URLs for other files by appending relative paths to the current URL. This ignores the Content-base: HTTP directive, and the <BASE= ...> tag, which can modify the notional base URL.
In CharlieSpider/0.4 (under construction) we'll see a much more robust URL expander that takes both these problems into account.

Future Projects

Doesn't exist yet. The robots described in Current Projects are HTTP-based link-followers. A different type of robot is conceivable, for web structure analysis on a local server.

Such a spider will traverse a filesystem stored on a large web server, maintaining a database of files and their links, along with reference counts for each file. It will trace broken links and report on orphan files and weblets, and cyclic sections in the web. (As such, it will be useful for maintaining large websites; See, for example, Hantsweb -- a system containing > 116,000 HTML documents.)