Perl Regular Expression


How it is used

Reference:

Perl Metacharacter Summary

Items to match a single characters

. dot Match any one characters
[...] character class Match any character listed
[^...] negated character class Match any character not listed
\t tab Match HT or TAB character
\n new line Match LF or NL character
\r return Match CR character
\f line feed Match FF (Form Feed) character
\a alarm Match BELL character
\e escape Match ESC character
\0nnn Character in octal, e.g. \033 Match equivalent character
\xnn Character in hexa decimal, e.g. \x1B Match equivalent character
\c[ Control character, e.g., \c[A? Match control character?
\l lowercase next character
\u uppercase next character
\L lowercase characters till \E
\U uppercase characters till \E
\E end case modification
\Q quote (disable) pattern metacharacters till \E

Example 1: character class
if ($string =~ /[01][0-9]/) {
     print "$string contains digits 00 to 19\n";
} else {
     print "$string contains digits 00 to 19\n";
}
 

Example 2: negated character class
if ($string =~ /[^A-z]/) { print "$string contains nonletter characters\n"}
else { print "$string does not contains non-letter characters.\n"}
 

Class Shorthand: Items that match  a single character in a predefined character class

\w  Match a "word" character (alphanumeric  plus "_")
\W  Match a non-word character
\s  Match a whitespace character
\S  Match a non-whitespace character
\d  Match a digit character
\D  Match a non-digit character

Quantifiers: Items appended to provide "Counting"

* Match 0 or more times
+ Match 1 or more times
? Match 0 or 1 times
{n} Match exactly n times
{n,} Match at least n times
{n, m} Match at least n but no more than m times

Items That Match Positions

^ Caret, Match start of the line (can match multiple times when /m (multiline matching)
$ Match end of the line (can match multiple times when /m (multiline matching)
\b Match a word boundary
\B Match a non-(word boundary)
\A  Match only at beginning of string
\Z  Match only at end of string, or before newline at the end
\z  Match only at end of string
\G Match only where previous m//g left off (works only with /g)

Grouping and Alternation

 
| Alternation, Match either expression it separates
(...) Limit scope of alternation, Provide grouping for the quantifiers, Capture matched substrings for backreferences.
\1, \2, ... Backreference, Match text previously matched within first, second, ..., set of parentheses.
(?:...) Grouping only, non-capturing parentheses
(?=...) Positive lookahead, non-capturing parentheses
(?!...) Negative lookahead, non-capturing parentheses
Example 3: contain only digits
if ($string =~ /^\d+$/) {
      print "$string contains only digits.<BR>\n";
} else {
      print "$string does not contain only digits.<BR>\n";
}

Example 4: contain IP address

foreach $string (@testdata) {
   if ($string =~ /(\d+)(\.\d+){3}/) {
       print "$string", ' matches  /(\d+)(\.\d+){3}/', "\n";
   } else {
       print "$string", ' does not matche  /(\d+)(\.\d+){3}/', "\n";
   }
   # if ($string !~ /([^.]+)\.([^.]+)\.([^.]+)\.([^.]+)/) {
   # a.b.c.d will be considered as legal ip address
   # without ^ and $ below -123.235.1.248 is a legal ip address
   if ($string !~ /^([\d]+)\.([\d]+)\.([\d]+)\.([\d]+)$/) {
      print "$string not an IP address\n";
      next;
   }
   $notIP = 0;
   foreach $s (($1, $2, $3, $4)) {
      print "s=$s;";
      if (0 > $s || $s > 255) {
          $notIP = 1;
           last;
      }
   }
   if ($notIP) { print "\n$string is not an IP address\n"; }
   else { print "\n$string is an IP address\n"; }
}

Example 5: Extract URL fields
$url = param('url');
print "url=$url<BR>\n";
$url =~ m|(\w+)://([^/:]+)(:\d+)?/(.*)|;  # use m|...| so that we do not need to use a lot of "\/"
$protocol = $1;
$domainName = $2;
$uri = "/" . $4;
print "\$3=$3<BR>\n";
if ($3 =~ /:(\d+)/) { $portNo = $1} else { $portNo = 80}
print "protocol=$protocol<BR>domainName=$domainName<BR>
portNo=$portNo<BR> uri=$uri<BR>\n";

The above code were used in checkurl.pl to parse the field in the following url:

URL:

Greedy vs. Non-Greedy Matching

Put a ? after these quantifiers make them non-greedy quantifiers
Example 6: lazyquote.pl
#!/usr/bin/perl
print "Illustrate the lazy operator?\nHere is the text:\n";
$text = "The name \"McDonald's\" is said \"makudonarudo\" in Japanese";
print "$text\n";
$text =~ /(".*")/;
print "/(\".*\")/ matches $1\n";
$text =~ /(".*?")/;
print "/(\".*?\")/ matches $1\n";
$text =~ /("[^"]*")/;
print "/(\"[^\"]*\")/ matches $1\n";

Example 7: /re(?:turn-to: |ply-to: )/ is faster than  /(?:return-to|reply-to): /
/Bill(?= The Cat| Clinton)/ Matches Bill but only if followed by ' The Cat' or ' Clinton'.

/OH \d+(?!\.)/  matches 'OH 44272'   not capturing mean it will not put matching string to $1.
/OH \d+(?=[^.]) matches 'OH 44272'   not including the last digit 2.
 

Modes, append at the end of regular expression

 
i ignore case
g global, in substitute case s/.../.../g, repeat substitution multiple times.
m multiline matching mode

Example 8: $var =~ s/\bJeff\b/Jeff/igm;
Try remove any (combination) of the igm modes in the following program and see the effect.
#!/usr/bin/perl

$text = "JeFFerson JEFF jeff\nJeFF\t JefF\nJEff JEFf\n";

print "text=$text\n";
$text =~ s/^\bJeff\b/Jeff/igm;
print "resulting text=$text";

Example 9: Extracting the urls from the href and src attributes in a htm file.
#!/usr/bin/perl
use CGI qw(:standard);
print header();

$file = param('file');
print "file=$file<br>\n";
open(IN, $file);
@lines=<IN>;
$text = join "\n", @lines;

@srcs=($text =~ m|src\s*=\s*\"([^\"]+)\"|ig);
@hrefs=($text =~ m|href\s*=\s*\"([^\"]+)\"|ig);

print "<P>list of href values<BR>\n";
$count = 1;
foreach $href (@hrefs) {
   print "$count href=$href<BR>\n";
   $count++;
}
print "<P>list of src values<BR>\n";
foreach $src (@srcs) {
   print "$count src=$src<BR>\n";
   $count++;
}
close(IN);

http://cs.uccs.edu/cgi-bin/cs301/listurl.pl?file=CS301F98photo.html

http://cs.uccs.edu/cgi-bin/cs301/listurl.pl?file=test.html
test.html content:
<a href=   "test.html"> <img src ="test.jpg">
<a href=
"http://cs.uccs.edu/~cs301/perl/re.htm">
<img src=
"http://cs.uccs.edu/~cs301/images/chow.jpg">