Perl: Terse and Ugly?


[ Site Index] [ Attic Index] [ Perl Index] [ Feedback ]


One of the biggest truisms about Perl is that There Is More Than One Way To Do It (whatever It may be).

It's like this. You're looking for a way of Doing Something. And lo, it ain't easy to figure out how to go about it, because you're fairly inexperienced in using this language. So you appeal on comp.lang.perl.misc for help; and someone answers with two very terse looking lines of mangled code. Someone else volunteers a slightly different, but broken, version that is at least readable; then along comes Randal or Tom or one of the other Perl Gods, and they deliver a half-line-long command that resembles line noise, is three times as efficient as the other solutions, and leaves you scratching your head.

Why?

There's actually a reason why this happens, and it's all to do with the semantics of the Perl programming language -- its ability to apply a command to different data types (scalars and arrays) and do different things in context. Many operators work on scalar variables, so it strikes a beginner as sensible to wrap them in a loop:

foreach $myvar (@some_data_array) {
    $myresult = &do_something_in_a_scalar_context($myvar);
    push (@output_data, $myresult);
}
This is the First Post on the road to enlightenment: it works, but it's flawed.

It's flawed because it's not necessary: there are some functions in Perl that are specifically designed to apply some operation to every element in an array, and return another array containing either every member of the input array for which the operation returned TRUE, or every member of the input array after applying the operation regardless of what it returned. They're called grep() and map() and we'll see them later in this article.

It's also flawed because the place it usually crops up is here:

foreach $myvar (<SOME_INPUT_FILE>) {
foreach is the commonest loop iterator in Perl; it's set in turn to each member of the list that follows it. If you put a filehandle in brackets, it's evaluated as a list. So the line above sets $myvar to each line in SOME_INPUT_FILE in turn. But this is bad: Perl evaluates the contents of the brackets before it runs the loop, so if you feed it a file handle like this it reads the entire file into memory before looping over it.

In real life, if you want to do this sort of thing to a file it's far better to use while; while can also be made to loop on a filehandle, but it doesn't chow down on the whole thing before it starts looping:

while ($myvar = <SOME_INPUT_FILE>) { # read a line from SOME_INPUT_FILE
And it doesn't store the whole contents of the file in an anonymous array when it runs.

Leaving aside the issue of memory efficiency when reading from filehandles, it's amazing what you can do by using Perl to digest whole arrays in one go. Someone recently posted on comp.lang.perl.misc:

>Once I have sorted a list, how can I remove any duplicate entries.
>
>I was wondering if there is an easy way, as oppossed to parsing
>through the whole list and comparing each entry.
There are several ways of doing this; however it's better to do the elimination of duplicates before you sort the list.

For starters: how do you identify duplicates in a list?

This is pretty much an FAQ. You do it by building an index of the items in the list. For each item, you keep a count of how often it appears. If it occurs more than once, obviously it's duplicated. The best place to keep such a list is in a hash (associative array):

foreach $entity (@inputlist) {
    $index{$entity}++;
}
Once we've looped through the items in @inputlist once, %index contains an entry for each unique item; its value is the number of times that item was read during the foreach loop.

We can now fetch the keys to %index, knowing that it's just an array with one entry per unique item -- we can safely discard the values stored in %index because we're doing a uniqueness filter, not a frequency count. Then all we need to do is sort:

@outputlist = sort keys (%index);
And this is our second-stage program.

As I mentioned above, though, this ain't particularly terse. grep() is a better tool than foreach at doing this kind of loop. So is while. What we really want to do is this:

@outputarray =  grep( ( ($h{$_}++ == 1) || 0 ), @inputarray ); 
What does this mean?

Well, according to the Camel, grep takes two arguments: an expression and an array. For each item in the array, it applies the expression. It returns another array, containing all those items in the input array for which the expression was true.

What we're doing here is applying to each item of @inputarray the following expression:

( ($h{$_}++ == 1) || 0)
Don't be fooled: %h is a hash. $h{$_} is the entry in the hash for the current item in @inputarray. $h{$_}++ means "add one to $h{$_}". If $h{$_} doesn't already exist, this brings it into existence and gives it a value of 1; if it does exist, it adds 1 to it.

Now this whole thing is grouped in an expression:

($h{$_}++ == 1)
The == is evaluated after we've done $h{$_}++. So if we've just added a new item to %h with 1 value of 1, this expression is true. Otherwise, the expression evaluates to 0, or false.

In short: this expression evaluates to 1 if we haven't met the current item in @inputarray before, and 0 otherwise. If it evaluates to 1, it goes on the @outputarray returned by grep().

We can tighten it up a little further:

@outputarray = sort grep( ( ($h{$_}++ == 1) || 0 ), @inputarray ); 
And we now have a sorted array containing only unique items.

We're going to have to adopt a slightly different approach to doing this on a file, because grep is as bad as foreach when running against a file: it slurps the whole thing into an anonymous array before execution. But we just use a while statement:

while (<>) { ($h{$_}++ == 1) && push (@outputarray, $_); };
Here we're using while as an iterator.

Finally, there are some really sick things you can do with regular expressions. As Claudio Calvelli commented, "if you can't do it in a regular expression it isn't Perl":

print("".(($line=join("",<>))=~s/.*\n/$h{$&}++?"":$&/ge,$line));
or (this boggles description and defies explanation):
print((($line=join("",<>))=~s/.*\n/index($`,$&)>=$[?"":$&/ge&&$line));
And finally (this is inexcusable):
($l=join("",<>))=~s/.*\n/index($`,$&)>=$[||print$&/ge;
(Thanks to Claudio Calvelli for these examples. The nurse will be along with your Chlorpromazine shortly.)


[ Site Index] [ Attic Index] [ Perl Index] [ Feedback ]