Chapter 5: Introducing Perl

Regular Expressions


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]


Perl lets you do just about everything to an input line that you can do in awk, and a few extra things besides. It does this by using regular expressions, a compact notation for representing patterns of characters. Several functions are provided for working on regular expressions:
tr// translate character ranges
s///perform sed-like string editing
=~binds a pattern to a variable

By default, everything happens to $_, although you can specify operations on other variables.

For example:

$fred = "This is a sample line of text";
$fred =~ s/of text/of green elephants/; 
print $fred;

prints:

This is a sample line of green elephants

The command s/of text/of green elephants/ replaces each occurence of the expression "of text" with the expression "of green elephants". Both of these regular expressions simply represent the characters they consist of.

By default, the s/// command is applied to $_, but by using $fred =~ (command) we have bound the effects of the pattern-matching command to $fred.

We can do more than apply simple string operations. Perl recognizes UNIX-style regular expressions.

A regular expression is a pattern that matches a string of characters, rather than a literal string of characters. That is, it is a kind of template for text; it describes the form the text must take, rather than the exact content. A simple text search, such as s/food/, must match the letters food, in that order. But a regular expression can specify something like the letter "f", followed by one to three "o"s, then some otherterminating letter (other than a "z"). For this reason, regular expressions make it easier to search for something where you don't know precisely what the word is, but know part of it (or how it is formed).

Regular expressions are formed using ordinary letters, some of which are given special meanings. Thus, the string "food" can be used to make an exact text search, matching the letters "f","o","o","d", even though it is interpreted as a regular expression. The letters simply represent themselves. However, a number of punctuation characters have special meanings: you can override the special meaning by "escaping" them with a preceding backslash "\" if you really want to search for a literal asterisk, for example.


Regular expressions in a nutshell:

In the beginning, a special meaning was assigned to the full stop ".". It meant, "any character goes here".

For example, sh.t can stand for shot or shut (or a couple of other words that don't belong in a reference book).

Because this wasn't much use for matching a real full stop, a convention was agreed: the backslash "\" would henceforth mean "remove the special meaning of the next character". So \. means "remove the special meaning of the following "." metacharacter", in other words, match a ".".

So sh\.t can stand only for the literal text sh.t.

"\\" means "match a literal backslash".

The "*" was next to be picked on. It meant, "zero or more of the preceeding character". So ".*" is short for "zero or more of anything".

For example, sh.t can only match one random character (shot, etcetera), but sh.*t can match sht or shot, or shoot, or even shoooooot.

Because you need to type "..*" to match "one or more of anything", an abbreviation was added - "+", which means "match one or more of anything". (So "..*" is roughly equivalent to ".+", although there are some subtle differences I won't go into here.)

For example, fo+d matches fod, food, foood, and so on, but fo*d can also match fd (by matching "f", "no o's", "d").

You can specify a precise number of matches. {n} means "match precisely n occurences of the preceding expression", while {m,n} means "match from m to n of the preceding expression. For example:

e{1}

matches a single "e", on its own, while:

e{1,3}

matches one to three e's.

Expressions can be grouped by putting them in brackets. You can match any of a set of grouped expressions by separating them with a vertical bar character (which means "or"):

(red)|(blue)|(green)

matches any of those sub-expressions.

Finally, the idea of matching one of a set of characters seemed handy - so the square brackets "[..]" were co-opted. Put some characters in square brackets, and they will match any one character (as long as it's one of them). For example:

[abdef]

matches "a" or "f" but not "c"

A range of characters can be represented in brackets: for example:

[a-ds]

means "match any character from "a" to "d" inclusive, and "s", but nothing else".

By putting a "^" at the start of a character set, you can negate its meaning. For example:

[^a-ds]

matches any character not in "[a-ds]".

In addition, Perl recognizes the following extra expressions:
\wany alphanumeric character (including "_")
\W any nonalphanumeric character
\bany word boundary character (Within a character class, \b represents a backspace character rather than a word boundary.)
\B any non-boundary character
\sany whitespace character
\Sany non-whitespace character
\dany digit
\Dany non-digit
^matches the beginning of a line (if it occurs unescaped or outside square brackets)
$matches the end of a line (if it appears unescaped)

Built-in variables

Perl has a huge number of built-in variables, like awk. Some of these are useful for setting up the internal configuration of a Perl program; for example, it is possible to use $[ to set the base subscript of an array. Normally, $[ is 0, but for compatability with some systems it is useful to be able to set $[ to 1. Alternatively, $/ specifies the input record separator. When reading from a file, Perl reads in a record delimited by the character specified by $/. Normally this is a newline character, but it can be reset to a different value if the file being read is not a UNIX text file. Similarly, $\ sets the output record delimiter. Normally this variable is unset -- Perl does not automatically print a newline after each print statement -- but if you set it to "\n" Perl will automatically add a newline after each print (or other output).

Here are some of the other, more commonly used built-in variables:
$.The current input line number of the current file handle. For example:
foreach (<>) {
    print $_;
}
print $.;

prints the standard input, then the number of lines read from it.

$_The current workspace. By default, if you don't specify a variable most Perl commands are applied to $_.
$`The string preceding whatever was matched by the last pattern search.
$'The string following whatever was matched by the last pattern search.
$/The input record separator. Normally a newline "\n" character.
$\The output record separator. By default, this is undefined. It is printed after every output record.
$,The output field separator. Normally this is undefined. If you set it to some value, that value will be printed between each output field.

For example:

$, = ":";
$a = $b = $c = $d = 1;
print $a, $b, $c, $d;

results in:

1:1:1:1

$#Controls the output format for numbers. Perl can handle C's printf() function -- there is an exact equivalent -- and by default the simpler print function formats numbers as %.20g. You can reset this by assigning a printf-style format specifier, to change the precision with which numbers are printed.
$0Contains the name of the script being executed.
$@The error status returned by the last eval command. If set, it indicates that the eval failed.

( eval($foo) executes $foo as a Perl script in its own right. You can use this as a way of trapping errors in your programs.)

@ARGVAn array containing the command-line arguments to the script. For example, if the script test.pl contains:

foreach (@ARGV) {
    print "$_ ";
}

The if you run it by typing

test.pl 1 2 3

it will print:

1 2 3

Note that ARGV does not contain the name of the script itself; that's $0.

%ENVAn associative array containing the environment for the running program. You can create new environment variables by assigning a new value in %ENV.

For example, $ENV{"REQUEST_METHOD"} returns the HTTP method used to run a CGI script from the environment (either GET or PUT).



Operating systems

One of the original design goals of Perl was to replace the UNIX shells -- specifically the Bourne and Korn shells -- as the tool of choice for writing a certain kind of application. Typically, large shell scripts tend to run slowly because they rely on external programs (UNIX tools) to do most of their work.

A lot of the standard UNIX tools were written to give access to a specific low-level feature of the UNIX operating system. For example, ls lists files by reading a directory file, identifying the inodes (index nodes) associated with each filename in the directory, then executing a stat() function call to obtain information about the inode for each file. (stat() returns information about a file, such as its length, creation time, and so on.)

Perl has all the necessary functions to implement ls built-in; it has a suite of functions ( opendir(), readdir(), closedir() and so on) for reading directories, and it has its own version of stat() for obtaining information about files. (This is useful, because Perl can run on platforms other than UNIX, including some which don't have inodes or anything similar; stat() in Perl on these programs simply returns the equivalent information about a file.)

In fact, Perl has a whole plethora of low-level commands for interfacing with operating system services. Using system(), Perl can execute other programs (assuming the operating system it is running under knows how to do this). Using sysread() and syswrite() it can read and write raw data from a file, controlling the number of bytes read. Just about the entire UNIX library of function calls for handling shared memory, semaphores, locks on files, and so forth, are implemented -- as are sockets, for Perl can handle pipes and internet domain sockets. (One particular library shipped with the Perl runtime, chat2.pl, comes in handy here: it allows us to write easy text-level TCP/IP applications that connect to a port and chat with the server on that port. We'll see more of this later.)

Perl's operating system interaction hooks are somewhat daunting; there are over a hundred such commands built-in. They do confer a major advantage: any file, process, or memory management operation that you would normally accomplish in a shell script by invoking an external program can probably be accomplished within Perl by a judicious use of the more obscure language features.


[ Comments ] [ Copyright ] [ Chapter contents ] [ Book contents ]