Regex
from perldoc perlretut... A regular expression is simply a string that describes a pattern. Patterns are in common use these days; examples are the patterns typed into a search engine to find web pages and the patterns used to list files in a directory, e.g., ls *.txt or dir *.*. In Perl, the patterns described by regular expressions are used to search strings, extract desired parts of strings, and to do search and replace operations.
Find...
Find the position of all matches
$str = <<EOS;
Tell his soul with sorrow laden if,
within the distant Aidenn,
It shall clasp a sainted maiden
whom the angels name Lenore --
Clasp a rare and radiant maiden
whom the angels name Lenore."
Quoth the Raven, "Nevermore."
EOS
while($str =~ m/\wen\b/g) {print pos($str), " " };
Here we want to know the position of any word that ends with "en" (that is we want to match a word-character followed by "en", then followed by a word-boundary. If we had wanted to find the first position of a known word we could have used Perl's index() function, but since we are searching for several occurrences of a pattern the pos() function is a better choice. This function will return the offset of the current match in a g modified search. This information is already being used by Perl because in g modified searches each repeat search will continue right after the position of the last match. One trap with the pattern we used is that pos will return the position of the end of the pattern. If you want to know where your matched word begins you'll have to use this pattern: \b(?=\w+en\b) which uses the look-ahead wrapper (?=...). This changes the meaning slightly to be "match a word-boundary if it is before a word ending in 'en'." Now we are matching the preceding word-boundary, not the word itself so pos will now return the position of the first character in our matched word -- exactly what we want!
Find and highlight doubled words
$str = "Thus joyful Troy maintained the the watch of night..."; $str =~ s/(\b(\w+)\s+\2\b)/[$1]/g; print $str;
This pattern is designed to match a word followed by white-space and then the same word again. We define this pattern as a word boundary (specified in regex as \b) followed by 1 or more word characters (specified as \w+), followed by one or more white-space characters, followed by the second parenthesized pattern (specified as \2 in the search pattern), followed by another word-break. This whole matched pattern is then replaced by itself wrapped in square-brackets.
Find sub calls and evaluate them
$_ = "guess_name() is your name.";
s/(\b\w+\(\))/$1/eeg;
print;
sub guess_name {
return "Rumpelstiltskin";
}
This trick demonstrates an unusual use of the evaluate modifier added to a substitution. The e modifier will take the replacement value and evaluate it, as if it were code to run, before substituting it for the found search pattern. In this case we are looking for a pattern that looks like a sub call with no arguments, that is "a word-boundary, followed by 1 or more word-characters, followed by an opening and closing parenthesis." Since we wrapped the pattern in parenthesis the found string is stored in the special variable $1 (notice we had to escape the other parenthesis in our pattern, to avoid perl's special interpretation). The first e modifier takes $1 as its argument and evaluates it to become "guess_name()". The second e evaluates it again to become "Rumpelstiltskin".
Count...
Count the letters in a string
$str = "And now to Xanthus' gliding stream they dove..."; $count = $str =~ s/([a-z])/$1/gi; print $count;
If you merely want a count of every character in a string use the built-in function length() but this trick will allow you to count only those characters in a particular range (like alphabet characters between a and z). The trap that novice regexers will fall into is to assume that a substitution operation (=~) will return a string. In fact this operator acts directly on the left-hand argument and returns a number representing how often that argument was affected. We can turn this to our advantage however by substituting any character in a range (denoted by the class [a-z]) with itself. This is done by using a regex special variable $1 which holds the contents of the first parenthesized match in the search pattern. The end result is to leave the tested string unaltered and return a count of the characters in the given range. You may have noticed I used two modifiers to the pattern, global to cause every character to be tested and ignore case to match lower and uppercase characters in the search pattern.
Count the words in a string
$str = "And now to Xanthus' gliding stream they dove..."; $count = $str =~ s/((^|\s)\S)/$1/g; print $count;
This trick is just a variation on the previous character counting trick. This time we need to use a search pattern that will match whatever we consider a "word" to be. For our purposes we will consider that we have found a word each time a white-space character (specified in regex as \s) followed by a non-white-space character (specified in regex as \S) is matched. There is trap to avoid however, the very first word in a string will usually not match this pattern since it usually won't be preceded by a white-space character. To handle this special case we add an extra variation to the pattern. The carat ^ will match "the beginning of the string" we are searching. Using this pattern we are now counting non-white-space characters that follow white-space or that are at the beginning of the string. The result is that we have a count of all the word beginnings in our string.