Essays

Big CGI Applications

[ Comments ] [ Copyright ] [ Main contents ]

Writing a small CGI script is easy.

Big ones, however, offer an entirely different set of challenges to the programmer. This is the first of two articles explaining the pros and cons of writing big, complex, CGI scripts in Perl -- although the principles are mostly applicable to any other CGI development environment.

I was recently called in to write a web-based system administration interface for a firewall-like product: McAfee's WebShield. This system is basically a stand-alone PC with two ethernet ports; one of these is connected to the outside world (via a router), and the other connects to the internal network. All network traffic between the outside world and an internal network is thus funnelled through this system, which scans it for viruses.

One of the requirements for this product was that the users (or rather, the system administrators) - should be able to administer it entirely through their web browser.

As you can imagine, administering a firewall product is not a simple job. Even after heroic attempts had been made to make it into a turnkey product, there were about forty distinct tasks that the users might be called upon to do. A menu-based interface already existed, written using the UNIX CURSES library to provide a suite of fill-out forms relating to these tasks; my job was to build a CGI system functionally equivalent to a large system administration utility.

Application or Tool?

There is a big difference between a one-shot tool and an application. A one-shot tool is something like the common or garden form-mail script; an HTML form that takes some input and a corresponding CGI script that emails the input to a designated recipient. A tool has no flow-of-control to speak of; it simply takes some input information and does a job with it. In contrast, a big application like a shopping-cart system, a database, or a system administration interface, is much more complex; the user interacts with it, changes the state of some underlying data structures, and has a wide range of tasks they can accomplish. In turn, the program must keep an eye on the data the user is dealing with -- in the case of a retail system, for example, it needs to know what goods the user has selected, and what state their financial transaction is at.

Keep an eye on that weasel-word "state". It's critically important to CGI programming -- and it makes the difference between a toy and a three-month job.

From the outside, an application looks like a big shiny toy -- you can play with it, twiddle the controls, wind it up, and watch it do things. When it stops doing things, you wind it up some more.

From the inside, the application looks a little different. In order for it to be possible for you to interact with it, it needs to have some idea of who you are, what you're doing, and where it's been. It needs a memory. That memory is, in computer terms, referred to as "state"; a program may exist in numerous states, which are defined in terms of the internal variables it knows about, and the processes that are going on inside it in response to the user's inputs. User inputs cause a program to change its state, at which point maybe different user inputs will be needed.

Maintaining State

In the last paragraph, I used two terms: "process" and "program". If you're a computer programmer you might have sat up and paid attention, because interchanging those two terms is usually a mistake; they have a very specific meaning in computer science, and you could argue that I've misused them. But as I hope to show, the web makes the normal certainties flexible ... so let's define some terms.

A "program" is a bunch of instructions that a computer's CPU understands. It is, however, no more than data; data consisting of machine instructions. Start feeding a program to a computer so that it executes it, and the computer will begin stashing data in memory, executing instructions, evaluating branch conditions, and so on; a long process that usually can follow a number of different courses. This is a "process"; a program in the act of being executed, along with all its associated data, variables, files, and so forth. It's a bit like the difference between the script of a stage play, and an actual performance.

No two theatrical performances are identical, but software processes differ even more radically; they have a nasty habit of interacting with users, who enter data and thus cause them to go different ways. As data is entered, the state the program is in changes.

Normal software processes run continuously (at least insofar as they are concerned -- the operating system may have ideas about multi-tasking, but the processes themselves can't actually see this happening). But the world wide web is, paradoxically, a discontinuous medium. All HTTP transactions are unique and state-free; the outcome of one transaction has no impact on the state of the HTTP server, or the outcome of the next request.

This leads to a big problem when you try building an interactive application on the web. Simply put, a web server receives a request; runs a CGI script to service it (or reads and sends a file), then closes the connection. This isn't very interactive! So if you want to track the state of a user's session on a web server, you need to do it yourself - or make it entirely unneccessary.

There are no good solutions to keeping track of state information in a web-based application. Yes, you can do it; but there are drawbacks. Here's a list of possible methods:

Technique Description
Cookies A cookie is a piece of data that can be passed back and forth between a web browser and a server in the HEAD section of an HTTP request. You can store a unique number in a cookie, and use this as the key to a database of user state information on the server; when your CGI script receives a request containing a cookie, it knows where to go to find the appropriate state information. (Problem: users can refuse to accept a cookie, or -- on appropriate operating systems -- cause them to be dumped in the bit bucket. And people are justifiably paranoid about cookies, which have been over-used by some marketing operations and abused as tools for gathering personal information.)
Weird URLs A universal resource locator is first and formost a way of referring to the location of some information resource on the Web. However, it doesn't have to be just the name of a file or the address of a CGI script. You can append information to a URL following a question-mark, and this information may be retreived by a CGI script if it is invoked via the GET method. So you can put all your state information in the URL your users click on to talk to your program. (Problem: it's visible to the users, and prone to attempts at forgery.)
Internet address All HTTP transactions carry with them the internet address from which they originate. It's tempting to assume that, because so many people have personal workstations, all connections from a given IP address are part of the same serial transaction. (Problem: it ain't necessarily so. I am aware of at least one site where a firewall with a single IP address acts as a proxy gateway for no less than 18,000 users. Mistake that for a single user workstation and your entire billing and invoicing system is going to collapse.)
Embedded protocols An embedded protocol is some client/server protocol embedded within a web session, which is intended to keep track of user state information. For example, you could use a Java applet to interface with a remote database; the web is used to download and initialize the applet, which then opens its own socket back to the database server and takes over the session. (Problem: this isn't the web are more -- this is something else. Not everyone who can use the web can use Java applets; this isn't a solution to the web-state conundrum so much as it is an attempt to sidestep the issue.)
Persistent connections HTTP 1.1 permits a session to stay open, so that more than one document can flow down it. This doesn't give you persistent CGI state maintenance, but it does allow you to provide the illusion of a process on the server which is continually changing (rather than discretely changing) from the point of view of the client.
Persistent CGI objects Normal CGI objects are loaded in response to a request, execute in a set environment, send results to the web server for transmission back to the client, then die. A mechanism for persistent CGI objects is provided by the Fast-CGI protocol. FastCGI is similar to CGI, except that where CGI scripts follow a serial life cycle of loading, executing, running, and terminating, FCGI scripts (like this one) execute in a loop. The script is loaded by the server when the server first runs, and hangs around waiting for a request. When a request comes in, the FCGI processes it, then goes back to waiting for another request. There's nothing to prevent an FCGI from retaining state information between requests. (Problem: race conditions and deadlock may occur. FCGI scripts that attempt to retain state information internally -- for example, by stashing it in a shared memory area -- are vulnerable to the possibility that two requests for the script may come in rapid succession and overwrite each other's data; programmers who intend to use FCGI for state maintenance need to pay a lot of attention to resource locking.)

As you can see, all these alternatives have their drawbacks. In my opinion, the most promising for the future is FCGI; although it has some drawbacks it is more or less compatible with many existing scripts, it provides a good performance improvement, and with some judicious (read: paranoid) programming techniques it could yield dividends.

But that's not how I did it: I prefer to tackle the problem in an entirely different way. Which is, quite simply, to avoid having to track state information wherever possible -- by making the necessary minimum of state information implicit in the structure of the CGI application itself.

Defining an API

The first thing I had to do was to design a framework that would allow me to carry out some basic administration work on a Linux system with additional virus scanning extensions.

Most configuration issues on Linux are handled by editing text files that contain arcane instructions that tell various servers or daemon processes how to behave. Having edited such a file, you then need to restart the appropriate process.

It seemed sensible to define an application-programming-interface (API) that would do the dirty work -- editing configuration files, restarting daemons -- and which I could then manipulate from my CGI scripts. Thus, changes to the CGI scripts wouldn't necessarily impact in any way on the actual administration process itself -- and I could use the same software in different CGI scripts, too.

This actually worked quite well. Perl is an ideal language for munching on textual configuration files; I defined a top-level class with some standardized methods (including an AUTOLOAD method to handle storage of arbitrarily-named fields in appropriate record structures), then some specific sub-classes to handle the file types I was interested in. (The sub-classes handled such mundane issues as parsing the records in each file and storing them, and writing an object's data back out to a file.) While this generic approach is insufficiently abstract for some more complex items -- I'd hate to try to force it to deal with DNS tables, for example -- it was more than adequate for tweaking the necessary configuration files on a WebShield box. (In the interests of brevity, copyright control, and security I am not going to describe those specific files.)

It also seemed sensible to keep a small amount of state information on hand. The overall application (based on the original CURSES-type utility) consists of a set of menus that glue together forms; each form governs one aspect of the configuration of the system. It's reasonable to design a system administration interface in such a manner that the state of each administration form relates purely to the current user interaction, and the underlying state of the system administration API, rather than to the state of any other forms. In other words: if you go into form A and press a few buttons (but don't commit any changes to the underlying system configuration), then go into form B, when you next visit form A your unsaved changes will have been lost.

It is relatively easy to track state information within a single CGI script, by using the Perl 5 CGI module; this takes care of the grunt work, providing an object-oriented wrapper at the level of the interaction between a user and a CGI program, rather than the rather lower level of a single HTTP transaction. This gives the necessary level of state tracking, without requiring any extra programming effort (to transfer state information between different CGI scripts).

I'll discuss object-oriented APIs elsewhere (in an article on the construction of interfaces to client/server systems). For now, let's just say that it makes life a lot easier in the long run, especially if you need to manipulate the same complex data structures at several different points within your application.

Glue

So far so good; I've described a simple approach to writing a cart-load of CGI scripts designed to handle various aspects of system administration, and how to minimize the need to spend time and brain cells on tracking user state information. But what makes this bunch of scripts into an application? And how are they organized?

The other half of the puzzle (which I'm going to release some source code for, when I've tidied it up a bit), is a tool for gluing CGI scripts together.
In a nutshell: if you can keep state changes simple -- within a single form dialog and it's resulting output -- then you need to provide navigation glue to give the users the illusion that they're interacting with a monolithic application. In reality, they're dealing with a hierarchy of small scripts, but it doesn't look or feel that way. All the scripts read and write certain common configuration files; thus, if the user changes the interface using one form, the next time they look at another form the changes will show up. It's like the phenomenon of persistence of vision; the human eye can't distinguish discrete successive events at an interval of less than about forty milliseconds, so it interprets them as uniform motion.

The web glue is a macro processing program. It traverses a directory tree, examining each directory in turn as it descends. In each directory it expects to find a data file, plus zero or more other directories and CGI scripts. The data file tells it what the CGI scripts are named -- what links to them should be captioned with in the web of the user interface -- plus, optionally, one or two other things.
As the macro processor descends the directory tree, it builds an output directory tree somewhere else on the same computer. It then populates the output directory with CGI scripts, copied from the source directory tree. As it copies them, it replaces embedded tags (typically looking like %this%) with something more appropriate; links to adjacent files, for example, or an HTML toolbar. The tags go inside strings defined in the Perl scripts, so that you can say in your code something like:
my ($toolbar) = "%toolbar%";
print $toolbar;
When the processor sees this, it will attempt to execute a macro named %toolbar%, and insert the results of that macro inside the double-quotes in the Perl source. It then copies the modified perl script into the destination directory.

It's a bit like using the C pre-processor, except that cpp feeds its output into a C compiler directly; macro-processed perl is itself an executable, so it needs to be exported into a destination area.

The system I built has a number of built-in macros. As it traverses the directory tree, it builds a data structure (a tree, actually) containing the names of all the files and directories in the application. The macro processor uses this tree to build (and interpolate into the CGI scripts) URLs that point to other scripts and directories. The output looks rather like this web site; sparse textual toolbars that direct the user to all the parents of the current (page or script). This provides a good level of context, and makes it easy to orient yourself in a deep tree-structured site.

(In addition, the macros can be used to execute arbitrary chunks of Perl code on the CGI scripts -- but that's another matter that I won't go into for now.)

The main benefit of a system like this is ease of use. Want to move a form-based dialog somewhere else in your application? You simply move a perl script elsewhere in the directory tree, and edit the configuration files in its initial and final directories. (In the next version even this will not be necessary, once I rig the macro processor to get its initialization information from the scripts themselves.) I've been able to add several forms to a big application, and restructure it, in about five minutes flat. This simply isn't possible using conventional multi-form state-tracking CGI scripts (as I'll describe in another article).

Configuration management

Configuration management is the process of keeping the many files that go into an application in step with one another. Accidentally overwriting a recent version of some critical common library with an older version can trigger a cascade of small disasters elsewhere in an application; and that's just one example of what can go wrong when you don't have an adequate version control system to manage different releases of your files. Text books have been written on this topic, and for good reason. In a nutshell, it boils down to this:

Version control
The process of keeping track of changes made to your files as you work on them, add functionality, and make modifications.
Configuration management
The process of keeping different programs and libraries interoperable; of ensuring that you can track which version of program A calls what version of library B, so that whenever you (or someone else) need a copy of A you know that they also have access to the right version of B to work with it.

I have a panacea that I unreservedly recommend if you're working on a small project. It's called Xemacs. It's a development of GNU Emacs, optimized for X Windows (formerly known as Lucid Emacs). If you are developing CGI scripts on a UNIX platform, you've really got a choice of two text editor families: vi family editors and emacs family editors. I strongly dislike Emacs keystrokes (I'm convinced that their bad ergonomics is responsible for the prevalence of repetitive strain injuries among programmers), and until relatively recently I've been using Elvis 2.0. (Under X11, Elvis gives you a modern vi environment -- with goodies like online help, multiple buffers and windows, and syntax colouring for Perl.) But I'm pretty much convinced that, with viper-mode (a vi keystroke emulation mode) and cperl-mode (an advanced Perl editing mode) Xemacs offers a better solution for editing CGI scripts. And the killer is VC -- Emacs' built-in version control interface.

VC is, simply put, one of the easiest version control interfaces I've ever met. It sits on top of GNU rcs and provides Emacs with a version control menu. It's documented thoroughly in the O'Reilly book Learning GNU Emacs. It's a life-saver: you just get into the habit of periodically checking in then checking out whatever file you're working on (hopefully remembering to do so at a stage when it's known to be working), so that if something goes horribly wrong you can roll back to an earlier version.

Configuration management is somewhat harder. Unless you want to go the whole hog and use a system like CVS (which is overkill for small projects, and has a somewhat steep learning curve), you might want to experiment with embedding RCS version strings in Perl variables and having your applications check their libraries for appropriateness at load time. This should, at least, ensure that your program won't try to run with the wrong version of a support library underneath it -- giving you a chance to notice and correct the error, before it subtly corrupts your data. I confess I usually take a cavalier approach to configuration management -- mostly because I don't usually work on CGI projects which are beyond the grasp of a single programmer. Cooperative development has other priorities, and if you need to work with more than one other programmer I strongly recommend using an industrial strength configuration management tool as well as version control systems for managing individual files.

The working system

The Web administration interface to Webshield took perhaps twelve programmer-weeks of full-time work to write. It ran to some 38 CGI scripts, 11 libraries, and two support and development tools. In total, it took some 7500 lines of Perl code (not including comment blocks).

The fact that such a system works at all -- much less, as part of a solid, shipping, commercial product -- speaks volumes about the importance of good design in building CGI applications. It also tells a lot about the utility of Perl and the Web as an environment for rapid application development. Finally, it indicates the usefulness of the approach I've described above: split the application into lots of small dialogs that don't pass complex state information between them. As a side-note, I've just completed a CGI script that does pass hidden variables and state information back and forth among about four different forms. (It's intended for an application in which users may enter data in several different categories, all of which affect one another, before they can commit any changes -- hence the requirement for state transitions.) The increase in complexity was astonishing, and somewhat frightening: it reads like sphagetti code, because, as a state machine, it is sphagetti. (Remember, at the heart of any state machine is a transition diagram, which is best modeled in high-level software as a whole squirming bundle of GOTO statements. Ugly and hard to maintain, but efficient.) At a guess, if I'd had to build the Webshield interface in the same manner, it would have taken more than six months (and I'd still be hard at work on it).

[ Comments ] [ Copyright ] [ Main contents ]

Technique	Description
Cookies	A cookie is a piece of data that can be passed back and forth between a web browser and a server in the HEAD section of an HTTP request. You can store a unique number in a cookie, and use this as the key to a database of user state information on the server; when your CGI script receives a request containing a cookie, it knows where to go to find the appropriate state information. (Problem: users can refuse to accept a cookie, or -- on appropriate operating systems -- cause them to be dumped in the bit bucket. And people are justifiably paranoid about cookies, which have been over-used by some marketing operations and abused as tools for gathering personal information.)
Weird URLs	A universal resource locator is first and formost a way of referring to the location of some information resource on the Web. However, it doesn't have to be just the name of a file or the address of a CGI script. You can append information to a URL following a question-mark, and this information may be retreived by a CGI script if it is invoked via the GET method. So you can put all your state information in the URL your users click on to talk to your program. (Problem: it's visible to the users, and prone to attempts at forgery.)
Internet address	All HTTP transactions carry with them the internet address from which they originate. It's tempting to assume that, because so many people have personal workstations, all connections from a given IP address are part of the same serial transaction. (Problem: it ain't necessarily so. I am aware of at least one site where a firewall with a single IP address acts as a proxy gateway for no less than 18,000 users. Mistake that for a single user workstation and your entire billing and invoicing system is going to collapse.)
Embedded protocols	An embedded protocol is some client/server protocol embedded within a web session, which is intended to keep track of user state information. For example, you could use a Java applet to interface with a remote database; the web is used to download and initialize the applet, which then opens its own socket back to the database server and takes over the session. (Problem: this isn't the web are more -- this is something else. Not everyone who can use the web can use Java applets; this isn't a solution to the web-state conundrum so much as it is an attempt to sidestep the issue.)
Persistent connections	HTTP 1.1 permits a session to stay open, so that more than one document can flow down it. This doesn't give you persistent CGI state maintenance, but it does allow you to provide the illusion of a process on the server which is continually changing (rather than discretely changing) from the point of view of the client.
Persistent CGI objects	Normal CGI objects are loaded in response to a request, execute in a set environment, send results to the web server for transmission back to the client, then die. A mechanism for persistent CGI objects is provided by the Fast-CGI protocol. FastCGI is similar to CGI, except that where CGI scripts follow a serial life cycle of loading, executing, running, and terminating, FCGI scripts (like this one) execute in a loop. The script is loaded by the server when the server first runs, and hangs around waiting for a request. When a request comes in, the FCGI processes it, then goes back to waiting for another request. There's nothing to prevent an FCGI from retaining state information between requests. (Problem: race conditions and deadlock may occur. FCGI scripts that attempt to retain state information internally -- for example, by stashing it in a shared memory area -- are vulnerable to the possibility that two requests for the script may come in rapid succession and overwrite each other's data; programmers who intend to use FCGI for state maintenance need to pay a lot of attention to resource locking.)