Saturday, December 29, 2007

Perl

Practical Extraction and Report Language

Windows ActivePerl from ActiveState

Llama Book "Learning Perl, Fourth Edition, by Randal L. Schwartz, Tom Phoenix, and brian d foy".

In Perl, comments run from a pound sign (#) to the end of the line. (There are no "block comments" in Perl.)

#!/usr/bin/perl
On Unix systems, if the first two characters on the first line of a text file are "#!", then what follows is the name of the program that executes the rest of the file.

On non-Unix systems, it's traditional (and even useful) to make the first line say #!perl.

The perl interpreter compiles and then runs your program in one user step.
    $ perl my_program
When you run your program, Perl's internal compiler first runs through your entire source, turning it into internal bytecode, which is an internal data structure representing the program. Perl's bytecode engine takes over and runs the bytecode.

Perl uses a number or a string as a scalar interchangeably.

All Numbers Have the Same Format Internally: Perl computes with double-precision floating-point values.

exponentiation operator **

7.25e45 # 7.25 times 10 to the 45th power
2**3 # two to the third power

Literal strings come in two different flavors: single-quoted string literals and double-quoted string literals.

String values can be concatenated with the . operator
"hello" . ' ' . "world" # same as 'hello world'

A special string operator is the string repetition operator, consisting of the single lowercase letter x.
"fred" x 3 # is "fredfredfred"
5 x 4 # is really "5" x 4, which is "5555"

Perl automatically converts between numbers and strings as needed. How does it know whether a number or a string is needed? It all depends on the operator being used on the scalar value. If a numeric value is given when a string value is needed (say, for string concatenation), the numeric value expands into whatever string would have been printed for that number.

"Z" . 5 * 7 # same as "Z" . 35, or "Z35"

To run your program with warnings turned on, use the -w option on the command line:

    $ perl -w my_program
Or, if you always want warnings, you may request them on the #! line:
    #!/usr/bin/perl -w

Scalar variable names begin with a dollar sign followed by what we'll call a
Perl identifier: a letter or underscore, and then possibly more letters, or digits,
or underscores. Scalar variables in Perl are always referenced with the leading $.

You can give print a series of values, separated by commas:
print "The answer is ", 6 * 7, ".\n";

When a string literal is double-quoted, it is subject to variable interpolation besides being checked for backslash escapes.

Perl provides a delimiter for the variable name in a manner similar to the shell. Enclose the name of the variable in a pair of curly braces. Or you can end that part of the string and start another part of the string with a concatenation operator:

Table 2-3. Numeric and string comparison operators

Comparison

Numeric

String

Equal

= =

eq

Not equal

!=

ne

Less than

<

lt

Greater than

>

gt

Less than or equal to

<=

le

Greater than or equal to

>=

ge


----------








----------

 @lines = ; # read standard input in list context
When the input is coming from a file, this will read the rest of the file. But how can there be an end-of-file when the input comes from the keyboard? On Unix and similar systems, including Linux and Mac OS X, you'll normally type a Ctrl-D to indicate to the system that there's no more input. The special character is never seen by Perl, though it may be echoed to the screen. On DOS/Windows systems, use Ctrl-Z instead. You'll need to check the documentation for your system or ask your local expert if it's different from these.

Invoke a subroutine from within any expression by using the subroutine name (with the ampersand). Whatever calculation is last performed in a subroutine is automatically also the return value. So be careful when adding additional code to a subroutine since the last expression evaluated will be the return value. "The last expression evaluated" really means the last expression evaluated, rather than the last line of text. The parameter list must be stored into some array variable for the subroutine to use it, and Perl uses the array @_ for this purpose. The @_ variable is private to the subroutine. if there's a global value in @_, it is saved before the subroutine is invoked and restored to its previous value upon return from the subroutine.

By default, all variables in Perl are global variables. But you can create private variables called lexical variables at any time with the my operator.

my($m, $n) = @_; # Name the subroutine parameters

Nearly every subroutine will start with a line much like that one, naming its parameters.

"name" of the array in a scalar context to find out the number of array elements.

The my operator
The scope of a lexical variable's name is limited to the smallest enclosing block or file.

using my( ) with parentheses, remember that without the parentheses, my only declares a single lexical variable.

my $fred, $barney; # WRONG! Fails to declare
$barney my($fred, $barney); # declares both

Any new variable will start out empty: undef for scalars or the empty list for arrays.

use strict;
The use strict pragma tells Perl's internal compiler that it should enforce some good programming rules for the rest of this block or source file. It's better to use it from the start when it's needed. Perl will insist that you declare every new variable, usually done with my. Most people recommend that programs that are longer than a screenful of text generally need use strict. We encourage you to include use strict in your programs as often as possible.

if the subroutine has the same name as a Perl built-in, you must use the ampersand to call it. With an ampersand, you're sure to call the subroutine; without it, you can get the subroutine only if there's no built-in with the same name. Until you know the names of all of Perl's built-in functions, always use the ampersand on function calls

The least you can return is nothing at all. A return with no arguments will return undef in a scalar context or an empty list in a list context. This can be useful for an error return from a subroutine, signalling to the caller that a more meaningful return value is unavailable.

Perl has a shortcut for

while (defined($_ = <>))
{
print "I saw $_";
}

it won't read a line into $_ by default. It works only if there's nothing but the line-input operator in the conditional of a while loop; If you put anything else into the conditional expression, this shortcut won't apply.

foreach (<>) { print "I saw $_"; }
it has to read all of the input before the loop can start running since foreach needs a list context to work on. So we prefer while expression.

Input from the Diamond Operator < . >

Since the diamond operator is generally used to process all of the input, it's typically a mistake to use it in more than one place in your program. Remember, the diamond operator reads the input, but the input itself is generally found in $_ (by default).

The diamond operator will go to the next file automatically, much like what you'd expect from cat or another standard utility.

while (defined($line = < . >)) { chomp($line); print "It was $line that I saw!\n"; }

The diamond operator isn't looking at the invocation arguments. it works from the @ARGV array. This array is a special array preset by the Perl interpreter as the list of the invocation arguments. In other words, this is like any other array, but when your program starts, @ARGV is already stuffed full of the list of invocation arguments.

The print operator takes a list of values and sends each item (as a string, of course) to standard output in turn, one after another. It doesn't add any extra characters before, after, or in between the items.

Perl is interpolating an array into a string, so it puts spaces between the elements.

Generally, if your strings contain newlines, you'll simply want to print them:

    print @array;
But if they don't contain newlines, you'll generally want to add one at the end:
    print "@array\n";
This is the problem with the optional parentheses; sometimes, we humans forget where the parentheses belong. When there are no parentheses, print is a list operator, printing all of the items in the following list, which is what you'd expect. But when the first thing after print is a open parenthesis, print is a function call, and it will print only what's found inside the parentheses. Since that line had parentheses, it's the same to Perl as if you'd said this:
    ( print(2+3) ) * 4;  # Oops!

To print a number, generally use %g, which automatically chooses floating-point, integer, or even exponential notation as needed.

printf "%g %g %g\n", 5/2, 51/17, 51 ** 17; # 2.5 3 1.0683e+29

In Perl, printf is most often used for columnar data since most formats accept a field width. If the data won't fit, the field will generally be expanded as needed. A negative field width is left-justified (in any of these conversions). The %f conversion (floating-point) rounds off its output as needed and lets you request a certain number of digits after the decimal point. To print a real percent sign, use %%, which is special in that it uses no element from the list.

A filehandle is the name in a Perl program for an I/O connection between your Perl process and the outside world. That is, it's the name of a connection and not necessarily the name of a file. Once again, as with labels, the recommendation from Larry is that you use all uppercase letters in the name of your filehandle. Perl uses six special filehandle names for its own purposes: STDIN, STDOUT, STDERR, DATA, ARGV, and ARGVOUT.

Every program that runs on Unix (and many other modern operating systems) has an exit status, telling if it was successful. Traditionally, it is zero for success and a nonzero value for failure.

The die function prints out the message you give it (to the standard error stream, where such messages should go) and ensures that your program exits with a nonzero exit status.

  if ( ! open LOG, ">>logfile") {
die "Cannot create logfile: $!";
}
In general, when the system refuses to do something we've requested (like opening a file), $! will give you a reason (perhaps "permission denied" or "file not found," in this case). This human-readable complaint message will be available in Perl's special variable $!. But if you use die to indicate an error that is not the failure of a system request, don't include $! since it will generally hold an unrelated message left over from something Perl did internally. It will hold a useful value only immediately after a failed system request. A successful request won't leave anything useful there.

If you don't want the line number and file revealed, make sure that the dying words have a newline on the end. That is, another way you could use die is with a trailing newline on the message:

    if (@ARGV < style="font-weight: bold; color: rgb(204, 0, 0);">\n";
}
As a rule of thumb, put the newline on messages that indicate a usage error and leave it off when the error might be something you want to track down during debugging.

You should always check the return value of open since the rest of the program is relying upon its success.

The warn function works as die does, except for that last step because it doesn't quit the program. But it adds the program name and line number if needed, and it prints the message to standard error as die would.

that default filehandle may be changed with the select operator. Also by default, the output to each filehandle is buffered. Setting the special $| variable to 1 will set the selected filehandle (the one selected at the time the variable is modified) to flush the buffer after each output operation.

If you want to ensure the logfile gets its entries immediately, in case you might be reading the log to monitor progress of your long-running program, you could use something like this:


    select LOG;
$| = 1; # don't keep LOG entries sitting in the buffer
select STDOUT;
# ...time passes, babies learn to walk, tectonic plates shift, and then . . .
print LOG "This gets written to the LOG at once!\n";

If one of the three system filehandles STDIN, STDOUT, or STDERRfails to reopen, Perl kindly restores the original one. how you could send error messages to a file rather than to your program's standard error stream

   # Send errors to my private error log
if ( ! open STDERR, ">>/home/barney/.error_log") {
die "Can't open error log for append: $!";
}

Hashes

the keys must be unique.
$hash{$some_key}
This is similar to what we used for array access, but here we use curly braces instead of square brackets around the subscript (key).

Accessing outside the hash gives undef.

To refer to the entire hash, use the percent sign ("%") as a prefix.

a hash may be converted into a list and back again. Assigning to a hash is a list-context assignment, where the list is made of key/value pairs. The value of the hash (in a list context) is a list of key/value pairs. The pairs won't necessarily be in the same order as the original list. The order is jumbled because Perl keeps the key/value pairs in an order that's convenient for Perl so it can look up any item quickly.

It's rare to do so, but a hash may be copied using this syntax:

    %new_hash = %old_hash;

In the Perl grammar, any time that you need a comma ( , ), you can use the big arrow (=>) instead.

The keys function yields a list of all the keys in a hash, and the values function gives the corresponding values. If there are no elements to the hash, then either function returns an empty list:

    my %hash = ("a" => 1, "b" => 2, "c" => 3);
my @k = keys %hash;
my @v = values %hash;
To see if a key exists in the hash, (whether someone has a library card), use the exists function,

The delete function removes the given key (and its corresponding value) from the hash.


$ grep 'flint.*stone' chapter*.txt

Don't confuse regular expressions with shell filename-matching patterns, called globs. A typical glob is what you use when you type *.pm to the Unix shell to match all filenames that end in .pm. The previous example uses a glob of chapter*.txt. (You may have noticed that you had to quote the pattern to prevent the shell from treating it like a glob.)

To match a pattern (regular expression) against the contents of $_, put the pattern between a pair of forward slashes (/) as we do here:

    $_ = "yabba dabba doo";
if (/abba/) {
print "It matched!\n";
}

All of the usual backslash escapes that you can put into double-quoted strings are available in patterns, so you could use the pattern /coke\tsprite/ to match the eleven characters of coke, a tab, and sprite.

The dot always matches exactly one character (except the newline). a backslash in front of any metacharacter makes it nonspecial. So, the pattern /3\.14159/ doesn't have a wildcard character. The star (*) means to match the preceding item zero or more times. .* will match any character, any number of times. The plus (+) means to match the preceding item one or more times. The question mark (?), which means that the preceding item is optional in that it may occur once or not at all.

Quantifiers: 1) * 2) + 3) ? All three of these quantifiers must follow something since they tell how many times the previous item may repeat.

grouping: /(fred)+/

A character class, a list of possible characters inside square brackets ([ ]), matches any single character from within the class. A caret ("^") at the start of the character class negates it. the first hyphen in /HAL-[0-9]+/ doesn't need a backslash because hyphens aren't special outside a character class.

Some character classes appear so frequently that they have shortcuts. For example, the character class for any digit, [0-9], may be abbreviated as \d. The \s shortcut is good for whitespace. It's the same as [\f\t\n\r ], which is a class containing the five whitespace characters: form-feed, tab, newline, carriage return, and the space character.

\d \w \s and the opposite \D \W \S
Any of these shortcuts will work in place of a character class (standing on their own in a pattern) or inside the square brackets of a larger character class. Another compound character class is [\d\D], which means any digit or any non-digit, which means any character at all. This is a common way to match any character, even a newline as opposed to ., which matches any character except a newline.

The shortcut is that if you choose the forward slash as the delimiter, you may omit the initial m. This is a shortcut for the m// (pattern match) operator. Since Perl folks love to avoid typing extra characters, you'll see most pattern matches written using slashes, as in /fred/.

Choose a delimiter that doesn't appear in your pattern. It's common to use curly braces as the delimiter.

Optional Modifiers /s /i /x. Those flags, may be appended as a group right after the ending delimiter of a regular expression to change its behavior from the default.

Anchors: ^ $ \b \B

The caret anchor (^) marks the beginning of the string, and the dollar sign ($) marks the end. you can use both of these anchors to ensure that the pattern matches an entire string. /^\s*$/, which matches a blank line. But this "blank" line may include some whitespace characters, like tabs and spaces, which are invisible. The word-boundary anchor, \b, matches at either end of a word. This is similar to the feature often called "match whole words only" in a word processor's search command. The nonword-boundary anchor is \B.

Matching against $_ is merely the default; the binding operator (=~) tells Perl to match the pattern on the right against the string on the left, instead of matching against $_.

The match variables

$1 $2 $3 ...
parentheses also trigger the regular expression engine's memory. The memory holds the part of the string matched by the part of the pattern inside parentheses. If there are more than one pair of parentheses, there will be more than one memory. Each regular expression memory holds part of the original string. Since these variables hold strings, they are scalar variables; in Perl, they have names like $1 and $2.

An unsuccessful match leaves the previous memories intact, but a successful one resets them all. This correctly implies that you shouldn't use these match variables unless the match succeeded; otherwise, you could be seeing a memory from some previous pattern. This is another reason a pattern match is almost always found in the conditional expression of an if or while:

$&, $`, and $'.
Whatever came before the matched section is in $`, and whatever was after it is in $'. They have the same scope as the numbered match variables. Generally, that means they'll stay around until the next successful pattern match. The price is that once you use any one of these automatic match variables anywhere in your entire program, other regular expressions will run a little more slowly. Instead, they'll use a workaround. For example, if the only one you need is $&, put parentheses around the whole pattern and use $1 instead.

quantifiers
* {0,}
+ {1,}
? {0,1}

Substitution m///
It's true if a substitution was successful; otherwise it's false.

@fields = split /separator/, $string;

The split operator drags the pattern through a string and returns a list of fields (substrings) that were separated by the separators.

You could even have an empty field if there were two delimiters together. Leading empty fields are always returned, but trailing empty fields are discarded.

The default for split is to break up $_ on whitespace:

    my @fields = split;  # like split /\s+/, $_;

my $result = join $glue, @pieces;

there will be one fewer piece of glue than the number of items in the list.

This means there may be no glue at all if the list doesn't have at least two elements. The first argument to join is the glue, which may be any string. The remaining arguments are a list of pieces. join puts the glue string between the pieces and returns the resulting string.

Though split and join work well together, don't forget that the first argument to join is always a string, not a pattern.

When a pattern match (m//) is used in a list context, the return value is a list of the memory variables created in the match or an empty list if the match failed.

Greedy quantifier
Regular expression engines do a lot of backtracking like that, trying every different way of fitting the pattern to the string until one of them succeeds or until none of them has.

non-greedy /fred.+?Barney/. Prefers to match as few times as possible, rather than as many as possible.
Since the non-greedy form of the plus was +? and the non-greedy form of
the star was *?, you've probably realized the other two quantifiers
look similar. The non-greedy form of any curly-brace quantifier looks the same,
but with a question mark after the closing brace, like {5,10}? or
{8,

The variable $^I. By default it's undef, and everything is
normal. But when it's set to some string, it makes the diamond operator
(<>) more magical than usual.

$^I = ".bak";

Let's say it's time for the diamond to open our file
fred03.dat. It opens it like before but renames it, calling it
fred03.dat.bak.We have the same file open, but
it has a different name on the disk. Next, the diamond creates a new file and
gives it the name fred03.dat. That's okay because we weren't using that
name anymore. Now the diamond selects the new file as the default for output, so
anything we print will go into that file. The while loop will read a line from the old file,
update that, and print it out to the new file. This program can update hundreds
of files in a few seconds on a typical machine.
Pretty powerful, huh?
Some folks use a tilde (~) as the value for $^I since that
resembles what emacs does for backup files.
Another possible value for $^I is the empty string. This enables
in-place editing but doesn't save the original data in a backup file.

non-capturing parentheses, and we write them with a special sequence. We add a
question mark and a colon after the opening parenthesis, (?:), and that tells Perl we only use
these parentheses for grouping.


chapter 10


The idiomatic way of opening a file in Perl looks like this:

open CHAPTER, $filename
or die "Can't open '$filename': $!";

Since using angle brackets means both filehandle reading and globbing, how does
Perl decide which of the two operators to use? Well, a filehandle has to be a
Perl identifier. If the item between the angle brackets is strictly a Perl
identifier, it'll be a filehandle read; otherwise, it'll be a globbing
operation.

A directory handle looks and acts like a filehandle. You open it (with
opendir instead of open), you read from it (with
readdir instead of readline), and you close it (with
closedir instead of close). Instead of reading the contents of a file, you're reading the names of files (and other things) in a directory as in
this example:

  my $dir_to_process = "/etc";
opendir DH, $dir_to_process or die "Cannot open $dir_to_process: $!";
foreach $file (readdir DH) {
print "one file in $dir_to_process is $file\n";
}
closedir DH;

Like filehandles, directory handles are automatically closed at the end of the program or if the directory handle is reopened onto another directory.

Since unlink takes a list, and the glob function returns a list, we can combine the two to delete many files at once:
    unlink glob "*.o";
Here's a little-known Unix fact. You can have a file that you can't read, write, or execute, maybe you don't even own the file, that is, it's somebody else's file altogether but you can still delete the file. That's because the permission to unlink a file doesn't depend upon the permission bits on the file; it's the permission bits on the directory that contains the file that matter.

Those first two lines inside the loop can be combined (and often are):

    (my $newfile = $file) =~ s/\.old$/.new/;

This works to declare $newfile, copy its initial value from $file, and modify $newfile with the substitution.

Unix model of files and directories
Each file is stored in a numbered inode, which we can think of as a particular piece of disk real estate. One file might be stored in inode 613, and another in inode 7033. A directory is a special kind of file, maintained by the system. Essentially, it is a table of filenames and their inode numbers. Along with the other things in the directory, there are two special directory entries. One is . (called "dot"), which is the name of that directory; and the other is .. ("dot-dot"), which is the directory one step higher in the hierarchy. How can the system tell that a particular inode is available? Each inode holds a number called its link count. The link count for a directory should always be at least two: its listing in its parent directory and its listing in itself.

There's another rule about the links in directory listings: the inode numbers in a given directory listing refer to inodes on that same mounted volume.

Another restriction on links is they can't make new names for directories because the directories are arranged in a hierarchy.

So, links can't be added to directories, and they can't cross from one mounted volume to another.

A symbolic link (also called a soft link to distinguish it from the true or hard links that we've been talking about up to now) is a special entry in a directory that tells the system to look elsewhere.

A symbolic link can freely cross mounted filesystems or provide a new name for a directory unlike a hard link. A symbolic link can point to any filename, one in this directory or in another oneeven to a file that doesn't exist. But that means a soft link can't keep data from being lost as a hard link can since the symlink doesn't contribute to the link count.

The rmdir operator fails for non-empty directories. As a first pass, you can attempt to delete the contents of the directory with unlink, then try to remove what should now be an empty directory. For example, suppose we need a place to write many temporary files during the execution of a program:
    my $temp_dir = "/tmp/scratch_$$";       # based on process ID; see the text
mkdir $temp_dir, 0700 or die "cannot create $temp_dir: $!";
...
# use $temp_dir as location of all temporary files
...
unlink glob "$temp_dir/* $temp_dir/.*"; # delete contents of $temp_dir
rmdir $temp_dir; # delete now-empty directory

The initial temporary directory name includes the current process ID, which is unique for every running process and is accessed with the $$ variable (similar to the shell).

Using the program's name as well as the process ID is common, so if the program is called quarry, the directory would be something like /tmp/quarry_$$.














No comments: