A robot is an automated tool for exploring and retrieving files from
the web. You can find out more about how robots work in Chapter 8 of
my book
The Web Architect's Handbook,
or
The
World Wide Web Robots, Wanderers, and Spiders home page.
This page contains links to robot-related programs I have written.
Obsolete projects
- Websnarf
- Withdrawn. I wrote this robot in 1993/94, way back when the
web was a lot smaller than it is today (and I knew a lot less about perl).
It's not robust and should not
be trusted. If you insist, the sources are available
here.
Websnarf was written as a learning exercise in Perl (4.036 at that time).
It was also intended as an experiment in web document mirroring. Better
mirror tools are now readily available. One development goal was never
realized; this was for websnarf to automagically re-write local URLs within
retrieved documents so that a downloaded web would be useable on the
machine websnarf was running on. (I ran into problems relating to
arbitrary record length limits in dbm files, and had to abandon the
project).
Current Projects
These are a collection of small robots and robot-like tools I am currently
writing. They are implemented in
Perl, using the highly sophisticated
object-oriented toolkits for web and http access that are available from
CPAN (the Combined Perl
Archive Network).
I'm writing these tools partly as a learning exercise in object-oriented
programming, and partly because they may serve a useful function in
illuminating several areas; notably how robots work, how to develop
software tools using rapid prototyping in perl, and how web structure
can be mapped from a central point.
Two design features of the robots below need emphasizing:
-
These are HTTP-based robots; that is, they retrieve files via
the hypertext transport protocol (and any other protocols that the
Perl LWP installation knows about). Thus, they can only identify
and test links. They say nothing about the actual filesystem
stored on a given web server; they can't tell you, for example, if
you have a problem with orphan files (files not linked to from elsewhere
in your web), or if you have a problem with file partitioning (for
example, twelve thousand files stored in one directory).
-
These robots do not store any information about the documents
they retrieve for future use. While you could modify CharlieSpider/0.3 to
parse HTML files and do something with their contents fairly easily,
that's not a central design goal. Rather, the robots are designed to probe for
broken links. One consequence is that CharlieSpiders 0.2 and 0.3
only send HEAD requests to sites other than the one hosting their
starting URL. Another issue is that to convert them into search robots,
it would be necessary to build some kind of caching mechanism so that
URLs that have been previously probed would thereafter be left
alone (at least until any Expires: directive has expired).
- CharlieSpider/0.1
- A tool for verifying links via the web. CharlieSpider is a UNIX command-line
tool; it takes a single parameter (a URL to check). Given a valid URL,
CharlieSpider retrieves the target file, parses it as HTML, extracts any
links in it, expands them (if they are relative URLs), and tests whether
they are available by issuing a HEAD request. It then reports on what it
found.
This program is not an HTML checker, but provides a quick check for
link integrity in HTML pages. (It is the functional core of the
more sophisticated tools described below).
- CharlieSpider/0.2
- A variant on CharlieSpider/0.1,
for verifying local websites. Given a valid URL, CharlieSpider/0.2
retrieves the target file, parses it as HTML, extracts any
links in it, expands them (if they are relative URLs), and tests whether
they are available by issuing a HEAD request. It then repeats the process
recursively on any local links it found, and reports on any failed HTTP
requests (which are often indicative of broken links in the web).
A variant on this has been re-implemented as a CGI script
with a forms-based interface. Source code.
- CharlieSpider/0.3
- A variant on CharlieSpider/0.2, this version
runs as a CGI script and obeys the
robot
exclusion protocol. This is
particularly important, as it means that CharlieSpider/0.3 is sufficiently
well house-trained to let out on the web at large; it won't go where
it isn't wanted. You can
use it,
or see the annotated source code.
Fixing a bug
There are two subtle bugs in the three robots above:
-
Runaway recursion
Suppose you tell CharlieSpider/0.3 to probe from this web page. It's reasonable
to expect it to drill down, below this page
(http://$server/webbook/robot/index.html). It is somewhat less
reasonable to expect it to start probing other durectories, at higher
levels on the server. But this site 'leaks'; it has back-references
pointing to higher levels (http://$server/webbook/), and
consequently the search can spread out into directories it was never meant
to explore.
-
Base handing
The spiders described above take a current URL, retrieve the document
it refers to, and build URLs for other files by appending relative
paths to the current URL. This ignores the Content-base:
HTTP directive, and the <BASE= ...> tag, which can modify
the notional base URL.
In CharlieSpider/0.4 (under construction) we'll see a much more
robust URL expander that takes both these problems into account.
Future Projects
- CharlieSpider/2
- Doesn't exist yet.
The robots described in Current Projects are HTTP-based
link-followers. A different
type of robot is conceivable, for web structure analysis on a local server.
Such a spider will traverse a filesystem stored on a large web server,
maintaining a database of files and their links, along with reference
counts for each file. It will trace broken links and report on orphan
files and weblets, and cyclic sections in the web. (As such, it will be
useful for maintaining large websites; See, for example,
Hantsweb -- a system containing
> 116,000 HTML documents.)