Regular Expressions


[ Site Index] [ Attic Index] [ Perl/Tk Index] [ Feedback ]


[ Intro ] [ Chap 1 ] [ Chap 2 ] [ Chap 3 ] [ Chap 4 ]


Regular Expressions

In the preceding chapter we covered a minimal subset of the Perl language -- just enough Perl to write simple programs. Perl is a large and sometimes confusing language, and I've deliberately omitted any mention of some of its most powerful features because I wanted to keep the initial tour of the language simple and accessible. Now we've covered some fundamentals, it's time to begin introducing the more complex constructs.

In this chapter, we're going to start by examining strings -- sequences of text. First, we'll see a couple of useful tools for extracting chunks from strings. Then we'll look at regular expressions, a language-within-a-language for processing text patterns. Regular expressions can be used to search strings for specific contents, and to replace them without something else. Finally, we'll examine Perl's facilities for handling large volumes of textual (and other) data.


What's a string?

First, we need to clarify a few concepts that we haven't looked into too closely. Let's go back to that very first Perl program:

 print "hello world!\n"; 

"hello world!\n" is a string. A string is a sequence of characters: to be precise, it's zero or more characters. Unlike some languages (C, for example) Perl lets you dyanmically change the length of a string: a string is as long as you want it to be, and grows or shrinks if you insert or delete characters from it. When you give a string a name, you've got a scalar variable.

Let's go a bit deeper. A ``character'' in this context means a bit more than just a letter or a number. UNIX (and most other operating systems) store data as a series of bits, clumped into groups called words. An eight-bit-long group is usually called a ``byte''; UNIX uses a word length of eight bits, so a word is equivalent to a byte, in this context. Words are used to represent text or other information: for textual purposes, each word is used to represent a binary integer, the value of which is mapped to a specific character via a codeset. A codeset is a set of numbers with corresponding character values: probably the most widely known codeset is ASCII, a seven-bit codeset that maps the values 0 to 126 onto a variety of letters, numbers, and symbols. (Modern UNIX- like systems, such as Linux or Solaris, usually use the ISO 8859/1 codeset: an extended 8-bit version of ASCII that includes the accented characters required for English and most Western European languages.)

Perl strings contain characters -- or rather, eight bit long words. Perl doesn't really mind what kind of data you store in a string; it doesn't differentiate between binary and text information. All the information Perl needs about a scalar (such as where it's stashed in the computer's memory, and how long it is) is external to the string itself, unlike C (where the end of a string is indicated by a byte with the value of zero). This means you can do interesting things like read an entire binary file into a scalar variable, open a new file handle, and spit it back out. (C programs don't like doing this kind of thing using strings -- the end-of-string character is liable to occur more than once in most binary files, fooling a naive program into prematurely truncating the file. The same goes for some other languages, such as Tcl.)


Finding patterns

In the last chapter, we wrote a simple program that compared strings:

  if ($a eq "roast potato") {
      print "your choice of vegetable is $a\n";
  } else {

This is all very well, but what if $a contains, not ``roast potato'', but ``Roast potato'', or ``potatoes'', or ``fries''? Or even ``roast potato '' (note the extra space at the end)?

The ``eq'' operator is a very blunt instrument. It works more or less like this (pace the C strcmp() function):

We have two strings, string 1 and string 2. To compare them, first check their length: if they're not the same, then obviously the strings differ. If they're the same length, take the first character in each string: if they differ, obviously the words differ. If they're the same, compare the next two characters: repeat until the strings are exhausted. If we get to the end, obviously they're the same.

``eq'' only detects exact matches. It's not obvious how we can modify it to look for a substring -- for example, to see if a target scalar contains ``potato'' -- without wrapping it in a lot of loops. (And for reasons we'll examine later, lots of loops are A Bad Thing.)

Obviously, we could add more if statements, but that would get tiresome very quickly. What we need is some kind of way of recognizing an approximation of a potato -- the essence of potato-ness, so to speak. It would also be useful to be able to detect synonyms, so that ``fries'', ``spuds'', and similar terms would also work. And such an approximator should be compact -- it shouldn't require fifty lines of ``if'' statements to recognize the humble pomme d'terre.

This is where regular expressions come in.

Regular expressions are a compact notation for matching patterns in text. In the example above, we're seeing if $a is identical with ``roast potato''. In the example below, we're seeing if it matches the pattern /roast potato/:

  if ($a =~ /roast potato/) {
      print "I found a roast spud in your food\n";
  }

The =~ operator is special: it's like eq, but it means ``return true if the preceding object matches the following expression''. It's a match operator: it doesn't insist, like eq, that the preceding and following expressions should be identical, just sufficiently similar.

The slashes surrounding ``roast potato'' delimit the expression. And what we're trying to match is not the literal text "roast potato"; it's a pattern consisting of the letters [r][o][a][s][t][<space>][p][o][t][a][t][o], in that order.

(Actually, the slashes are a special kind of quoting convention, indicating that the text inside them is a pattern, not a literal string. We'll examine the various quoting operators later.)

The =~ operator hunts for an 'r', then an 'o', then an 'a', and so on, through the target string. If it fails to match one of the terms of the regular expression, it doesn't simply stop and return false; it starts again, from the beginning of the expression, with the next letter in the target. So it doesn't return true only if $a is exactly the same as its search expression; it returns true if $a contains a match for the expression. $a can contain an entire menu, for all =~ cares: what it returns depends on whether it contains the target sequence of patterns.

Of course, "roast potato" and /roast potato/ are effectively the same. In the syntax of Perl regular expressions, a letter represents itself: /a/ is a pattern that matches the character 'a'. But there are other, non-alphanumeric characters: and Perl uses them to represent wildcards or special expressions.

Suppose we are unsure whether our target string may be capitalized. Regular expressions give us a couple of ways to see if our potatoes are important:

  if ($a =~ /[Pp]otato/) {

  if ($a =~ /potato/i) {

In the first form, we see [Pp]. This is a character set: a special pattern that matches one character from among its members (in this instance, ``P'' and ``p''). Effectively, /[Pp]otato/ means ``match a single 'p' or 'P', followed by [o][t][a][t][o]''.

As an aside, there is a no-match operator: !~. !~ evaluates to true if the pattern following it does not match the scalar being tested.

  $a !~  /potato/

is equivalent to:

  $a =~ !/potato/

(As you may have gathered, the exclamation mark in Perl is usually pronounced ``NOT''!)

NOTE: sets
A set consists of one or more characters. For example:

  [abc]

is the set of the characters ``a'', ``b'', and ``c''. As a regular expression it doesn't match the string [a][b][c], it matches any one of its members.

Suppose we want to talk about the set of all lowercase letters. We can write this as:

  [abcdefghijklmnopqrstuvwqyz]

but there's a compact notation, using the range operator:

  [a-z]

The dash means ``everything from 'a' to 'z' in the current codeset''.

(Of course, it is possible to have a dash in a set of characters, because '-' is itself a character. To do this, it should be the first character in the set: [-ab] is the set ``[-] or [b] or [c]''.)

We can also specify inverse sets. To match any character except [a] or [b] or [c], we do this:

  [^abc]

The leading caret ``^'' denotes that the set contains every permissable character except the ones specified.

(If you want to be picky and write a set containing a caret and a dash, just do it like this: [-^]. Both symbols are positionally sensitive in the set notation; a ``-'' loses its special meaning if it appears at the beginning of a set, and a ``^'' only has a special meaning when it appears at the beginning.)

Sets let us write compact specifications of the form of a string. For example:

  [A-Za-z1-90_]

consists of the set of all characters that are legal in Perl variable names.

To represent unprintable characters (those that aren't mapped to letters, numbers, or punctuation by your computer's codeset), we use their octal (base eight) value, preceded by a backslash and a zero: for example, the character with decimal value 32 (usually a space character in ASCII), has the octal value 40; to represent it in a set we would write it as \040. So we could write the set of all characters from ASCII 0 to 32 (decimal) as:

  [\000-\040]

(We can also represent characters by their hexadecimal value, using a leading \0x ; for example, ASCII 32 (decimal) is 20 in hexadecimal (base sixteen), so can be represented as \0x20.)

It's important to remember the backslash ``\''; it means either ``ignore the special meaning of the following character'', or, ``the following character is an escape sequence''. Two backslashes \\ mean ``one backslash literal''; a backslash preceding a slash (\/) means ``one literal slash'' rather than the usual pattern delimiter.

Finally, there is one special set that's worth remembering: the set of all characters is represented by a period ``.''. A period matches anything at all with one exception: the newline. (This is a bit of a complex topic: see ``Newline handling'' elsewhere in this chapter.)

In the second form, we see a pattern modifier; the modifier /i (case insensitive). All the letters in the pattern are tested for either uppercase or lowercase versions. We could equally well write:

  if ($a =~ /POTATO/i) {

because /POTATO/i and /potato/i are equivalent.

NOTE: modifiers
The meaning of a regular expression can be modified by special flags that follow their end delimiter (trailing slash). The five commonest flags are:

/i
Case insensitive. Letters in the expression are matched for both uppercase and lowercase occurences.

/x
Extend legibility. Normally, a space in a regular expression matches a space character in the target string; a tab or other whitespace character also acts as a pattern. When the /x modifier is applied, whitespace in the expression is ignored, unless it is preceded by an escape (backslash) character. (So the pattern /roast potato/ is equivalent to /roast\ potato/x.) Also, the # character is treated as a comment marker rather than a character to match.

The point of this is that it becomes possible to insert tabs, spaces, carriage returns, and comments inside regular expressions. Big regular expressions rapidly become illegible without space and comments; using the /x modifier lets us re-write for readability.

/s
Treat the string as a single line. Newline characters in the string are treated as simple white space, like spaces and tabs.

/m
Treat the string as multiple lines.

/g
Match globally. The =~ operator returns true after the first successful match: this modifier forces the pattern to be matched repeatedly until no more matches are available in the target string. (This is important in search/replace operations rather than simple matching).

See the perlop and perlre man pages for further details and additional obscure modifiers.

We can also look for alternative expressions, such as ``fry'', ``fries'', or ``spud''. For example:

  if ($a =~ /(potato)|(fries)|(spud)/i ) {

The brackets in the expression don't match literal brackets in the target string; they group sub-expressions. ``potato'', ``fries'', and ``spud'' are separate sub-expressions; the bar ``|'' between each sub-expression means OR. So this version is looking for ``potato'' or ``fries'' or ``spud'', in a case-insensitive manner.

``fries'' isn't going to match ``fry'' or ``french fry'' or ``french fries''. So we need to concentrate on generalizing this expression a bit further. We're going to use the /x modifier, so we can split the expression up to make it a bit more readable; then we're going to look for either ``fry'' or ``fries'' with an optional preceding ``french'':

  if ($a =~ / potato                   | # literal "potato"
              (french\ )*fr(y|ies)+    | # pattern for french fries
             spuds*
             /ix)  { ...

What does this mean?

The first sub-expression ``potato'' is fairly clear. It's followed immediately by ``|'', meaning ``or''. The second clause bears some explanation:

            (french )*fr(y|ies)+     | # pattern for french fries

First, we look for the sub-expression ``french ''. This is followed by ``*'', which means ``match the preceding expression zero or more times''. Because we bracketed ``french '', it applies to the whole word, not just the previous character (`` ''). Then we go on to match an [f], an [r], then a sub-expression: [y] or [i][e][s] followed by ``+''. The ``+'' expression means ``match the preceding expression one or more times''.

So what we're looking for is:

        "french " zero or more times, 
  then  "fr",
  then  "y" or "ies" one or more times

over 12

NOTE: Controlling the number of matches
The ``+'' and ``*'' expressions (one or more and zero or more) are examples of special expressions that govern how many times a match is made. They apply to the immediate preceding expression -- be it a literal character, a set, or a group in brackets -- and act as quantifiers.

Of course, they're not perfect; and we can run into some pathological cases. Take the 'banana' problem as an example:

  "banana" =~ /bana(na)+/

Returns true, because a banana is [b][a][n][a] followed by an ``na''. But this pattern will also match strings like:

  bananana
  banananana
  bananananana

and so on.

For more precise control over the number of matches, we need the quantifier expressions:

  {m}
  {m,n}
  {n,}

where m is an exact or minimum number of matches and n is a maximum number.

For example:

  "banana" =~ /ba(na){2}/

matches [b][a] followed by precisely two occurences of the grouping ``na'':

/ba(na){3,6}/

matches [b][a] followed by three to six occurences of ``na''. And:

/ba(na){0,2}/

matches [b][a] followed by up to two occurences of ``na''.

(Note that as ``banana'' is a substring of ``banana(an..)'', even if you're looking for /ba(na){0,2}/ and feed it something like ``banananana'' to match it will get a match. So if you want to match no more than 2 ``na'''s, you need to add a pattern meaning ``NO na's'' to the end of your expression.)

Thus, we have various options for constraining the number of times an expression is matched:

\*
matches zero or more times (eqivalent to {0,})

+
matches one or more times (equivalent to {1,})

{m}
matches precisely m times

{m,n}
matches from m to n times

{m,}
matches m or more times (no upper 'n' value)


White space

So far, we've been looking for potatoes in a line of text. But where do they appear? We have a choice:

  in the middle of a line of text
  at the beginning of a line
  at the end of a line
  straddling two lines 

Perl gives us special regular epxression operators that match a position in a string, rather than a character or set.

These positional expressions are distinct from the ones we've seen so far. An expression such as /.{3,4}/ (match three to four members of the set of all characters) or /foo|bar/ (match the string foo or the string bar) represent characters. The positional expressions represent positions in a string. They are:

^
bind to the beginning of a line

$
bind to the end of a line

Thus, /^In the beginning/ matches any string that starts with ``In the beginning'', while /the end\.$/ matches any string that ends with ``the end.'' (note the backslash escape to remove the special meaning of the period in the regular expression).

We can also look for word boundaries and the beginning or end of strings. The special expressions below don't actually match characters: they match the boundary between groups of different types of character:

\b
match a word boundary

\B
match something that isn't a word boundary

\A
match the beginning of a string

\Z
match the end of a string

A word boundary is a point between two characters, such that one of them is a word character (from the set [A-Za-z01-9_]; a shorthand expression for such characters is \w) and a non- word character (i.e. from the complementary set, [^A-Za-z01-9_]).

The \A and \Z expressions are used instead of ^ and $ when matching against multi-line strings with the /m modifier. They match only at the beginning and end of the string, not at the beginning or end of each line in the string.

An illustration of the difference between the $ position indicator and \n is in order. Consider a scalar that contains the following:

  Twas brillig, and the slithey toves\n
  did gyre and gimbal in the wabe

(where \n is a newline).

If we're searching for occurences of ``toves did'', we're going to miss them unless we use an expression like:

  $jabberwocky =~ /slithey(\s)toves/is

We look for ``slithey'', a white-space character, then ``toves'', doing multi-line matching so that the search doesn't get confused by the newline character. \n is treated as whitespace in the context of the /s modifier. When we're doing matching with /s, we can handle multiple lines in a single string as if they're a single line and \n is just another whitespace character.

The /m flag is a little different: it treats the string as more than one lines, and each \n also matches $. To see the difference the /m flag makes in action, try this code fragment:

  if ($jabberwocky =~ /(.+)\Z/igm) {
      print "found match [$1] in input $in\n";
  } else {
      print "pattern not matched\n";
  }
 

Then change the search modifiers /igm to /igs.

NOTE: Perl escape sequences
We've seen numerous examples of escape sequences in Perl strings and patterns. They all begin with a backslash ``\''; there's a reason for this. Pattern matching requires a way of distinguishing literal text from regular expressions. As not all the characters in any codeset are printable, rather than try to use some means of encoding their character values, Perl assigns mnemonics to them: for example, \n for newline, or \t for tab. On a UNIX system these are actual single characters, with the decimal values 013 and 009, but these values are by no means platform-independent or even memorable; hence the mnemonics.

Escape sequences are interpreted by Perl if they occur inside double-quoted strings or regular expressions (which are equivalent to double-quoted strings, except that they are patterns rather than string literals). The interpretation process (called interpolation) replaces escape characters with their literal value, and replaces the name of a variable with its contents.

Perl has lots of escape sequences; they are largely inherited from C, and they are documented in the perlre manpage, but here are the commonest:

\n
newline

\t
tab

\r
return

\f
form-feed

\a
alarm bell

\e
escape

\c[
control character

\w
word character

\W
non-word character

\s
whitespace character

\S
non-whitespace character

\d
digit character

\D
non-digit character

In addition, some escape sequences change the way subsequent regular expressions are processed:

\
remove the special meaning (if any) from the following regular expression (e.g. \* matches a literal ``*''; \( matches an opening bracket, and \\ matches a literal backslash.)

\l
lowercase the next character

\u
uppercase the next character

\L
lowercase the string until the next \E (end)

\U
uppercase the string until the next \E (end)

\Q
quote all special regular expression characters until the next \E (e.g., \Q^$*\E is equivalent to \^\$\*).


Pieces of string

All the above escape sequences, character sets, and quantifiers notwithstanding, we've still only scratched the surface of regular expressions. What we've seen so far is the mechanism for finding patterns in strings. Regular expressions in Perl let us do a lot more: we can use them to modify text as well as simply matching it.

NOTE
In addition to the regular expressions discussed in this chapter, Perl has functions (inherited from BASIC, no less) for directly modifying strings. See the perlfunc manpage for details of these functions: notably substr() which, given a numerical offset into a string, slices, dices, and returns a substring. It isn't pattern matching, but it's still sometimes useful, especially when working with fixed-length fields in some kind of input data stream.

It's also worth knowing about chomp(). Given a string, chomp() determines if it ends in a newline; if so, it removes the newline character from the end of it. Chomp exists to make it easier to read text files; you can do it with a regular expression instead, but chomp($line) is much more readable.

The first thing we can do with the pattern matching operator =~ is to work out what we've matched.

Consider the multi-line string from Jabberwocky again, and the following expression:

  $jabberwocky =~ /toves(\s)did/is;

The =~ operator returns true if we've found the pattern in the variable $jabberwocky; it returns false if the pattern isn't there.

But if the pattern is found, it also sets some special variables:

$&
contains the entire matched string

$`
contains everything in the string before the match

$'
contains everything in the string after the match

So if we say:

  $jabberwocky =~ /toves(\s|$)did/im;
  print "[$`][$&][$']\n";

We see:

  [twas brilling, and the slithey][toves
  did][ gyre and gimbal in the wabe]

(Note that the newline is part of the matched string.)


Replacing patterns

To change the target string, we use the substitute (s///) expression rather than the simple match // expression. It looks like this:

  $scalar_to_change =~ s/search_for_this/replace_with_that/;

Where replace_with_that is also a regular expression. For example, to remove the toves, we can write:

  $jabberwocky =~ s/toves//is;
  print $jabberwocky;

  Twas brillig, and the slithey 
  did gyre and gimbal in the wabe

This is useful, but not very precise. We can get better control over the patterns we are manipulating by using the grouping operators (brackets). If we try and match a pattern with groupings within it, the value of the first group is assigned to the special variable $1; the value of the second group goes into $2, and so on, which $+ contains the value of the last bracket match. For example:

  $jabberwocky =~ s/.+(toves)\s+(did).+/$2, $1/is;
  print $jabberwocky;

results in:

  did, toves

(To see if you understand grouping using brackets, try this fragment without the leading and trailing ``.+''. See if it behaves the way you expect it to.)

There's an additional regular expression parameter we need to see at this point: /g. /g is short for 'global'. Normally, when carrying out a search/replace operation on a string, Perl executes it just once. However, if you add a trailing /g, it will keep chewing on the target string until it can't make any more substitutions. For example, if we want to change ``the'' to ``a'' using:

  $jabberwocky =~ s/the/a/is;

It results in:

  Twas brilig, and a slithey tove
  did gyre and gimbal in the wabe

But if we add the /g modifier:

  $jabberwocky =~ s/the/a/isg;

It returns:

  Was brilig, and a sliay tove
  did gyre and gimbal in a wabe

The moral of this example is: keep track of those word boundaries when making global changes! Otherwise you're going to end up with some interesting results. For example:

  $my_target = 
    "She smiled at him. 'I travelled two miles today', she said.";
  $my_target =~ s/miles/kilometres/g;
  print $my_target;

  She skilometresd at him. 'I travelled two kilometres today', she said.

NOTE: Greed and parsimony
Perl's regular expressions are greedy. For example, suppose we want to excise everything before the first occurence of the word ``the'' in Jabberwocky. It looks logical to do this:

  $jabberwocky =~ s/.+(the.+\Z)/$1/is;

However, when we look at $jabberwocky, we get a nasty surprise: instead of:

  the slithey toves
  did gyre and gimbal in the wabe

it contains:

the wabe

Why?

The regular expression is greedy; an unlimited-length pattern like .* it matches the longest possible span of characters. There are two occurences of 'the' in the string: one on the first line and one on the second. So the pattern ``.*'' munches up most of the string, rather than laying off after the first match.

We can avoid this by using non-greedy matching: follow a quantifier (such as * or +) with a question mark ``?'' and it will match the minimum number of occurences possible without causing the expression to fail. So:

  $jabberwocky =~ s/.+?(the.+\Z)/$1/is;

will result in:

  the slithey toves
  did gyre and gimbal in the wabe

There's one last modifier that is really useful in the s/// construct: the /e modifier.

normally, if we write:

  $a =~ s/something/else/g;

We expect all occurences of /something/ in $a to be replaced by ``else''.

But we can also say:

  $a =~ s/something/myfunc($&)/ge;

The /e modifier tells Perl that the replacement is not a string, but a function to be evaluated. In this example, $& (the result of the match) is fed to myfunc() (whatever it is), and the results interpolated in place of whatever /something/ matched.

For example ...

  #!/usr/bin/perl
  $my_target = "She smiled at him. 'I travelled 2 miles today', she said.";
  $my_target =~ s/
                  (.+\W)          # look for some text
                  (.+?)           # followed by just one word
                  (\Wmiles)       # then "miles"
                  (\W.+)          # then some more words
                /print $1, ($2 * 1.61), " kilometres ", $4, "\n";
                /ex;
  print $my_target;       

What we're doing here is matching a sentence, then instead of routinely replacing a pattern with a string we're feeding the pattern into a function ($2 * 1.61; there are 1.61 kilometres to be mile) and using print() to create the output. Which looks like:

  She smiled at him. 'I travelled 3.22 kilometres  today', she said. 


Perl quoting conventions

Perl uses quoting to determine whether or not escape characters are interpolated in string, whether or not a string consists of literal characters or a pattern, and whether or not to replace variable names in a string with the corresponding values.

It's worth briefly noting Perl's different types of quoting for future reference. They're fully explained in the perlop manpage; here's an example of the commonest forms:

  $thing = "world";   # puts the literal string world in $thing
  print "hello $thing\n";

results in:

  hello world

whereas:

  print 'hello, $thing\n';

gives:

  hello, $thing\n

And:

  print `hello, $thing\n`;

results in ...

  hello: command not found

The reason for this is the different way different types of quote are interpolated;

single quotes
No interpolation occurs. All characters are string literals.

double quotes
Escape characters and variables are interpolated.

back-quotes
Escape characters and variables are interpolated and the resulting literal string is passed to the UNIX exec() function: the standard output from the execution of the string (as a UNIX command) replaces the quoted-string.

slashes
The string is subject to interpolation as if double-quoted; it is then treated as a regular expression rather than a string literal.

It's possible to quote some strings using different types of delimiting quote marks, using special generic quoting operators. For example, qw{fred joe} is equivalent to writing ``fred'' ``joe''; qw{} means ``quote words'', taking each element in the bracketed block and double-quoting it. This and the other generic quote operators are documented in the perlop manual page.

We're not restricted to using slashes to delimit patterns, but we're using slashes in this chapter for clarity. You can set up a different pattern delimiter character, to make things easier if you're searching for something that contains a lot of slashes; see the perlop manual page for details.


Scanning large quantities of text

In the last chapter, we looked at a simple program for opening a file and counting the lines and characters in it. We're now in a position to do some much more ambitious operations on such a file: to count words, to search for words or phrases, and to edit or otherwise mangle the input data into something useful to us.

Perl has two functions which come in handy for breaking up strings into sub-components, and assembling groups of scalars back into one lump: split() and join(). split() takes a string and a regular expression as arguments; it scans the string for the regular expression, and splits it up whether it finds a match, putting the results in an array (which it returns). join() does the opposite: it takes an array, and a string, and joins the elements of the array together into a single scalar (using the string argument as a spacer).

For example:

  @words = split(/\W/, $line);

$line is a line of text. split() searches for the pattern \W (a non-word character) and splits $line every time it finds a match; the results are returned in @words. (Thus, this expression crudely splits a line into its constituent words.)

If you invoke split() in a scalar context, it returns the number of items it found in the target string. For example:

  $sentence = "This is a sentence";
  $words    = split(/\W/, $sentence);

prints 4.

You can also invoke split() without supplying a string to split; it just assumes you want to chow down on $_, the default scalar.

You can get back the original line (more or less) using join():

  $line = join(" ", @words);

(Note that we're joining the items in @words on a space character, but we split $line on the generic regular expression mathing a non-word character. So our reconstituted $line isn't exactly the same as the original item.)

It's fairly easy, now we know how to use split(), to figure out a simple program to count the number of words in a text file:

  #!/usr/local/bin/perl
  #
  # wc.pl -- count words in a text file
  #
  while (<>) {
      $words += split(/\W/); # split words in $_ 
  }
  print "stdin contained $words words\n";

Here we're using split() to split the default scalar $_, and we're doing it in a scalar context -- i.e., looking for the number of items returned by splitting the target string. The operator += adds the term on the right of the operator to the one on the left; $a += $b means the same as $a = $a + $b. So we're using $words to keep a running total of the number of words encountered while there are lines to be read from the default input filehandle <> (the standard input). When we've finished reading input data we have a total number of words to print.

NOTE grep() vs map()
split() and join() let us convert scalars to arrays and arrays to scalars. We can also do pattern matches and more complex operations on an entire array.

One particularly common problem in text processing is to isolate a pattern in a file. UNIX gives us a family of commands to do this with -- all of them variants on a tool called grep (allegedly an acronym for global regular expression print).

We could search an input file for lines matching a pattern using a simple loop, and store the matches in an array. For example:

  while (<>) {
      if ($_ =~ /pattern/) {
          push(@found, $_);
      }
  }

(push() takes a scalar and sticks it at the beginning of an array; in the example above, we've glommed $_ onto an array called @found).

However, this is somewhat verbose. Perl provides a more concise way of expressing this idea: a command called grep(), which applies an expression to every element in an array in turn, and returns another array consisting of the results which were non-false (which returned a non-zero value). For example:

  @found = grep(/pattern/, (<>));

grep() scans the standard input (which is grouped in brackets, thus indicating that it is to be treated as an array) for /pattern/. Every line on the standard input is scanned in turn, and every time the pattern match evaluates to TRUE the result is added to the output array @found.

(As with split(), if you evaluate grep() in a scalar context it returns the number of items for which the expression evaluated to true.)

This expression is more or less equivalent to the longer loop statement above. It is less efficient, however, because it treats the standard input as an array: this forces Perl to read the whole file in, then scan through it using grep(). The while() structure, in contrast, operates one line at a time -- Perl checks for a pattern match as it scans through the standard input one line at a time, and doesn't try to treat it as an array. This is important if we happen to be feeding a fifty megabyte logfile to our Perl script -- one way, Perl only needs to hang onto a line or so at any time, but the other way, Perl has to swallow the entire thing before it can begin digesting it.

(So the moral of this story is: don't use grep() on filehandles, or indeed on anything except an array you've already created.)

One point about grep(); it isn't limited to searching. Though the example above indicates how to use a pattern matching expression with grep(), you can in fact feed in just about any kind of Perl expression you like. For example:

  @count = grep( ($words = split(/\W/), (<>));

Each line on the standard input is assigned to $_, run through split(), and the scalar context of split() (which is non-zero if the line contains any words) is plonked into @count. @count thus contains a series of numbers, representing the number of words on each line in standard input.

Although grep() makes a handy loop construct for applying simple expressions to an array, it has one failing: if an expression evaluates to false (returning zero), the results don't appear in the array returned by grep(). This isn't satisfactory for all circumstances, so Perl has a more general function called map(). map() is identical to grep(), except that it applies its expression to every element in a target array and places whatever is left over in its output array, regardless of whether it returned true or false. map() is best thought of as a tool for mapping between two sets (arrays) via a function. For example:

  @unique = map { $_ unless ($a{$_}++ > 1) }, @input_text;

This is a somewhat devious expression (and for reasons that will be explained is not necessarily a good example to emulate) ...

@input_text is an array of words. Each element in turn is assigned to $_. This map() function uses an associative array called %a. If %a doesn't already contain an entry $a{$_}, one is created and postincremented (so that it has a value of 1); if $a{$_} already exists, it's value increases. The expression:

  $_ unless ($a{$_}++ >0)

returns $_, unless $a{$_} has a value greater than 0; otherwise it implicitly returns zero. Turning this on its head; if we haven't previously met $_, $a{$_} doesn't exist. $a{$_} is undefined and therefore does not have a value greater than 0, so we return $_. We then postincrement $a{$_}, bringing it into existence and giving it a value of 1. If $a{$_} has a value of 1 or more, the expression fails, returning false -- in this context, an empty string ``''.

Thus, we only return $_ if we haven't met it before in @input_text.

Because we're applying this expression via map(), the array @unique receives either a line of text from @input_text, or an empty scalar (if the line in @input_text has already been encountered).

  Another way to look at this is to try to unroll it into
  a B<foreach()> loop containing the expression we're
  evaluating:

  foreach $line (@input_text) {
      if ( defined($a{$line})) { # does %a contain an entry for $line?
          $output = "";          # if so, $output is empty
          $a{$line}++;           # increment the entry for $line
      } else {
          $output = $line;       # $output contains our current line
          $a{$line} = 1;         # create an entry in %a for $line
      }
      push (@unique, $output);   # stuff $output onto our results array
  }

It should be clear that map() and grep() let us get away with one line where otherwise we'd need ten. On the other hand, it also makes it very easy to write hopelessly obscure and incomprehensible code. This doesn't matter if you're writing a fire-and-forget script that will be used once then thrown away, but if you're writing maintainable software (or working on a project as a member of a team) it's anathema in stylistic terms. Most especially, the kind of nested expressions above, and the use of map() or grep() for their side-effect of being implicit loops rather than for mapping between two arrays via a function, is a Bad Thing.

Despite the caveats in the note about map() and grep(), these two functions generalize our pattern-matching technique and allow us to process arrays (and indeed entire files) as single entities.

Corresponding facilities exist to process strings; and there is an additional function (not hitherto described: if you want to look it up it's called tr///, and it works a bit like s///) to translate between character sets.

Using the pattern matching conventions we've examined, we are in a position to scan files and arrays for complex data and do interesting things with it. Perl doesn't have the strong data typing of C or Pascal or a conventional programming language (for which reason it is often grouped with string-processing 'scripting' languages like Tcl, Bash, AppleScript, or even the DOS batch language); however, by judicious use of regular expressions you can tell whether you're handling an integer, or a floating point number, or a string, or a sentence, or even a PGP-signed electronic mail message. Regular expressions are extremely flexible and the lack of strong typing means that it is much easier to cook, bake, extrude, mangle, bend, spindle, and mutilate textual data in Perl than in more conventional languages.

Note that there's a lot more to regular expressions than we've covered in this brief introduction. If you want further information, you might want to look for a copy of ``Mastering Regular Expressions'' (Jeffrey E. F. Friedl, O'Reilly and Associates, pub. 1997: ISBN 1-56592-257-3). If you're really enthusiastic, see ``Compilers: Principles, Techniques, and Tools'' (Aho, Sethi, and Ullman, Addison-Wesley 1986, ISBN 0-201-10194-7).

Remember that acronym, ``Programmable Extraction and Report Language''? This is the ``Extraction'' bit. The rest will come in later.


Exercises

  1. Write a program that reads a stream of text from the standard input and builds a dictionary of all the words it read. Note: the input text consists of zero or more words per line. The output should be a list of all the words, and the number of times they appear in the file, sorted alphabetically. (You may need to look up the sort() function.)

  2. From the above program, write a derivative that prints the words in the file, sorted by frequency -- commonest words first. You should use at most two extra lines of code. (You may want to investigate the reverse operator.)

  3. Re-write programs (1) and (2) using, at most, two lines of Perl code.

  4. Why is this inefficient or undesirable?


[ Site Index] [ Attic Index] [ Perl/Tk Index] [ Feedback ]


[ Intro ] [ Chap 1 ] [ Chap 2 ] [ Chap 3 ] [ Chap 4 ]