Chapter 5: Introducing Perl

CGI Scripting in Perl


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]


Perl can be used to implement CGI scripts to automate form handling. In this section, we will see a simple CGI handler written in Perl. (This version is specific to UNIX; the MacHTTP server uses a rather different mechanism for communicating with scripts, although bindings for MacPerl are available.)

To recap: an HTTP server that supports the Common Gateway Interface exports several environment variables that can be read by scripts executed under it. Scripts usually reside in a designated directory (often called cgi-bin) under the server's home directory, although they can be located elsewhere. They can be invoked by HTTP requests; usually either a GET request, with an appended query string, or a POST request, with the body of the query in the body of the request. (Note that you can also trigger a script by sending a HEAD request, or using some other non-typical request method -- this frequently causes badly written scripts to fall over.)

The first thing a CGI script in Perl needs to do is to establish what query method is being used, and to obtain the query text. As we have seen, Perl can read the UNIX environment as if it is an associative array. For example, $ENV{"foo"} contains the value of the variable named "foo". The CGI specification requires that several environment variables be set; the main ones are:
REQUEST_METHODEither GET or POST -- the method used to submit the request. (Might also be a HEAD request, but such requests do not usually require any additional processing.)
QUERY_STRINGIf a GET request was submitted, this variable contains the text of the URL following the initial "?". (The part before the "?" points to the CGI script; the part following it consists of a query.)
CONTENT_LENGTHIf a POST request was submitted, this variable contains the length (in characters) of the query. The CGI script can then read the query from standard input. (Note that the script might hang if it tries to read beyond the extent of CONTENT_LENGTH.)
CONTENT_TYPEIf a POST request was submitted and this variable is set, it contains the MIME content type of the data in the request.

These are the important variables. There are some others that might be used: notably REMOTE_HOST (the host making the request), REMOTE_ADDR (the IP address of the remote host), REMOTE_USER (the authenticated User ID of the user making the request), and AUTH_TYPE (the authentication method used to validate the user). However, we will see some discussion of these variables later; they are not necessary in this example.

if ($ENV{'REQUEST_METHOD'} eq "GET") {

	$request = $ENV{'QUERY_STRING'};

} elsif ($ENV{'REQUEST_METHOD'} eq "POST") {
	read(STDIN, $request,$ENV{'CONTENT_LENGTH'}) 
		|| die "Could not get query\n";

} 

To get the query, the code fragment above first checks to see if the environment variable REQUEST_METHOD is of type GET.

If it is, it obtains the query string from the variable QUERY_STRING and stashes it in $request.

If it is not a GET request, then a check is made for a POST method. If the query has been posted, the script needs to read CONTENT_LENGTH bytes into the variable $request, from the standard input (STDIN); this is carried out by the read() function.

There is a die() call which is only executed if the read() fails.

Now we have the query, what can we do with it?

A typical complete query URL looks something like this:/cgi-bin/dog-bite-report.pl?size=medium&bite=bad

This can be divided into two parts: the CGI script itself:/cgi-bin/dog-bite-report.pl

and its parameters, the query:size=medium&bite=bad

Variable=Value pairs are separated by ampersands "&". Non-alphanumeric characters are represented by their hexadecimal value (as %nn) in the Latin-1 codeset; for example, %20 for space (ASCII/Latin-1 character 32 in decimal, 20 in hex). Note that the (obsolescent) ISINDEX tag encodes a space as a plus sign "+".

So we need to:

Typically, Perl scripts use an associative array to store the parameters to a CGI script; it's a convenient format, because you can store an arbitrary range of named values and check to see if they exist, and if so, what their value is.
@parameter_list = split(/&/,$result);	# split string on '&' characters
foreach (@parameter_list) {		            # foreach variable=value pair
    s/\+/ /g;                               # replace "+" with a space
    s/%([0-9A-F][0-9][A-F])/pack("c",hex($1))/ge;    
                                            # COMPLEX: replace any two characters
                                            # preceeded by a percent sign with
                                            # their own packed hex value. See the
                                            # perl "pack()" function for details

}

Firstly, we chop the query string up, dividing it on each &, and placing the results in an array of strings called @parameter_list.

Secondly, we loop over the array. Each element in the array is in turn implicitly assigned to $_ (the current workspace), and the substitution commands in the loop are applied to $_.

The first substitution is relatively simple. s/\+/ /g means, "substitute all literal "+" signs, replacing them with a space. Repeat globally." The "g" parameter ensures that if there is more than one "+" in the workspace, all of them will be processed. The backslash escape before the "+" is required because the "+" symbol is a normally a regular expression; its special meaning must be removed in order to make the s/// command look for it as a literal character.

The second substitution is a lot more ambitious. Let's look at it again:

s/%([0-9A-F][0-9A-F])/pack("c",hex($1))/ge;

The first part, the string we are searching for, is a regular expression consisting of a percentage sign followed by two hexadecimal digits (the numerals 0-9 and the letters A through F). The brackets group the two characters following the "%" sign, and they can be referred to later in the same search/replace operation as $1, because it is the first regular expression group to be encountered in this process.

(If we had the search expression /%([0-9A-F])([0-9A-F])/, the first character would be $1 and the second would be $2.)

The "replace" part of the substitution is more complex. In this case, the pack() function is called, taking as an argument the hexadecimal value of the two characters matched in the search expression. hex() converts a hexadecimal number to a hexadecimal string; pack() is used to take a value and pack it into some sort of data structure. In this example, the "c" parameter to pack() tells it to put the hexadecimal number into an unsigned char variable -- a character.

The /ge at the end of the expression tells Perl to apply the substitution globally (g) to the string, rather than stopping at the first match, and to evaluate (e) the replacement expression as if it is a Perl expression, rather than just a regular expression.

This is about as hairy as Perl regular expressions ever get. If you understood it, consider yourself an expert.

Now we have an array, @parameter_list, containing name=value pairs for each parameter passed in the GET or POST request. How do we turn that into a neat, handy, associative array?
foreach (@parameter_list) {
    ($name, $value) = split(/=/);
    $passed{$name} = $value;
}

This little bit of code iterates over the variable=value pairs in @parameter_list. Each entry in @parameter_list is loaded into $_ in turn. It then splits the entry on the "=" sign, stashing the two halves in $name and $value. Finally, we add the entry to the associative array %passed.

If we check the value of $passed{"something"}, it will either contain nothing (indicating that no variable called something was passed in the HTTP request), or it will return the value associated with something. So the array %passed is effectively an index to the variables passed to our script.

Of course, there are some potential pitfalls in the code listed above. For another thing, the split(/=/) command runs the risk of going awry if one of those hexademical numbers encoded in the value being split is itself an equal sign -- there would be an item like foo=bar=quux in @parameter_list, and this would not fit neatly into our final associative array. Therefore, it makes sense to do the splitting before we expand the characters. Putting it all together we get something like this:
if ($ENV{'REQUEST_METHOD'} eq "GET") {       # if we're receiving a GET
      $request = $ENV{'QUERY_STRING'};                 # the request is passed in the
                                                       # environment variable
                                                       # QUERY_STRING
                                                       # else ...
} elsif ($ENV{'REQUEST_METHOD'} eq "POST") { # if receiving a POST request
                                                       # the length of the posted data
                                                       # is passed in CONTENT_LENGTH,
                                                       # and it is read from stdin
      read(STDIN, $request,$ENV{'CONTENT_LENGTH'}) 
          || die "Could not get query\n";
} 
@parameter_list = split(/&/,$request);             # split the parameters up
foreach (@parameter_list) {                            # split each variable=value pair
    ($name, $value) = split(/=/);
    $name =~ s/\+/ /g;                                 # replace "+" with  spaces
    $name =~ s/%([0-9A-F][0-9A-F])/pack("c",hex($1))/ge;
                                                       # replace %nn with characters
    $value =~ s/\+/ /g;                                # repeat for the value ...
    $name =~ s/%([0-9A-F][0-9A-F])/pack("c",hex($1))/ge;
    if !(defined $passed{$name}) {
        $passed{$name} = $value;
    } else {
        $passed{$name} .= ":$value";
    }
}

This has the advantage of saving us an extra loop iteration, although it would have to be a remarkably huge CGI request for this to impact the performance of the average server

Note the final loop:

.i

if !(defined $passed{$name}) {  
    $passed{$name} = $value; 
} else {
    $value =~ s/:/\\:/g;  $passed{$name} .= ":$value";
}

This covers the possibility that our query string includes several values for one variable. For example, a multiple selection list might return something like:

files=first.txt&files=third.txt&files=fourth.txt

In this code fragment, if the %passed array doesn't contain a value for the key files, one is created; if the variable already exists, the new value is tagged onto the end of it, separated by a colon. (If $value contains a colon, the colon is replaced by "\:", to prevent confusion. It also covers backslash-escaped colons by additionally escaping them, like "\\:".) To get the list of values out of a variable (named by $key) like this and into an array, you do something like:

@parameters = split(/[^\\]:/, $passed{$key});
foreach (@parameters) { 
    s/\\:/:/g;
}


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]