Regular Expressions

Excerpted from "How to Program in CGI with Perl 5.0" by Stephen Lines


A regular expression consists of zero or more alternative patterns, which are strings of elements. Patterns are separated by the vertical bar character ( | ). An element is either an atom, quantified or unquantified, or an assertion. An unquantified atom always matches a single character, whereas a quantified atom can match zero or more characters. An assertion matches a contextual condition, such as the beginning or the end of a string. A regular expression matches a string if any of its patterns matches some part of that string, element-for-element. Testing always proceeds from left to right and stops at the first complete match. The elements relevant to the MVD are described below.

Unquantified Atoms

As an unquantified atom, each character matches itself, unless it is one of the special characters +, ?, ., *, ^, $, (, ), [, ], {, }, |, or \ (not including the commas, which are used here only for readability). The actual meaning of these special characters will become apparent below. To match one of them as a literal character, you can precede it with a backslash to "escape" its special meaning. For example, the special character . (period) is a wildcard that matches any single character, but \. matches only a period. In general, a preceding \ escapes the special meaning of any non-alphanumeric character, but it converts most alphanumeric characters into special atoms or assertions. Thus you can use \ on itself if you wanted to search for the backslash character. Some of the special atoms are enumerated below, and match as follows:

. (period)    Matches any character.

\w              Matches any alphanumeric character, including _.

\W             Matches any non-alphanumeric character, excluding _.

\s               Matches one whitespace character; that is, a tab, newline, vertical tab,
                 form feed, carriage return, or space (ASCII 9 through 13 and 32), which
                 individually match \t, \n, \v, \f, \r, and \40, respectively.

\S              Matches one non-whitespace character.

\d              Matches a digit, 0 through 9.

\D             Matches any non-numerical character.

\NNN        Matches the character specified by the 2- or 3-digit octal number NNN.
                 For example \177 matches the DEL character (ASCII 127).

\xXX         Matches the character represented by hexadecimal value XX; for example,
                 \xA9 matches the copyright character © (ISO Latin-1 169).

\cC           Matches the control character Ctrl-C, where C is any single character; for
                example, \cH matches a backspace (ASCII 8). This atom is the same as
                \NNN, where NNN is the octal value of ord(C) + 64.

[S]            Matches any character in the class S, where S is specified as a string of
                literal characters (as in [abc$%^&]), a range of characters in ASCII order
                (as in [a-z]), or any combination thereof (as in [a-c$-&^]). Most of the
                special characters lose their special meanings inside the square brackets,
                but the hyphen must be escaped as \-, the \b character matches a backspace
                (\010), and most other backslashed characters retain their special meanings
                as atoms and assertions.

Quantifiers and Quantified Atoms

The regular expression quantifiers are the special characters +, *, ?, and the expressions {N}, and {N,M}. A quantified atom is an atom that is followed by a quantifier. If A is any atom, A+ matches A one or more times; that is, it matches one or more adjacent substrings that each match A individually. Similarly, A* matches A zero or more times, and A? matches zero or one occurence of A. Furthermore. A{N} matches A exactly N times, A{N,} matches A N or more times, and A{N,M} matches a minimum of N and a maximum of M ocurrences of A. A quantified atom matches as many characters as possible, unless a ? is appended to the quantifier, in which case the atom matches the smallest substring allowed by the context.

Assertions

An assertion is different from an atom in that it doesn't match any characters but rather matches a contextual condition, such as a difference between two adjacent characters. Assertions match as follows:

\A            Matches the beginning of a string.

\Z            Matches the end of a string.

^             For our purposes this is the same as \A.

$             Likewise, the same as \Z.

\b            Matches a word boundary.

\B            Matches a non-boundary.

  

Examples of Regular Expressions

abc                abc anywhere in the search string.

^abc              abc at the beginning of the string.

abc$              abc at the end of the string.

ab|cd             ab or cd.

a(b|c)d           a followed by b or c, then d (abd or acd, not abcd).

ab{3}c           a followed by exactly 3 b's, then by c. This is the same as abbbc.

ab{1,3}c        a followed by 1, 2, or 3 b's; then by c. This is the same as abb?b?c.

ab?c              a followed by c with an optional b in between (ac or abc). This is
                     the same as ab{0,1}c.

ab*c              a followed by zero or more b's, then c (ac, abc, abbc etc.). This is
                     the same as ab{0,}c.

ab+c              a followed by one or more b's, then c (abc, abbc, etc.). This is the
                     same as ab{1,}c.

[abc]             Any string in the bracketed class, namely, a or b or c. This is the
                     same as [a-c] and a|b|c.

[abc]+           Any string of one or more characters from the braketed class
                     (a, b, c, aa, ab, ac, ba, bb, bc, etc.).

[^abc]           Any single character not in the class inside the brakets. (Note that the ^
                     character has a different special meaning at the beginning of a character
                     class than at the beginning of a pattern. In the interior of a character
                     class, or as an element in the interior of a pattern and not preceded by \n,
                     ^ matches itself.

\w+                Any string of alphanumeric characters, including _. This is the same as
                     [0-9A-Z_a-z]+.

\W+               Any string of non-alphanumeric characters. This is the same as [^\w]+.

[abe\b]           abe followed by a word boundary (the zero-width space between
                      alphanumeric and non-alphanumeric characters, that is, between
                      characters matched by \w and \W; this expression will not match the
                      abe in abecedarian.

.                     Any single character except a newline (\n).


Mariposa's Variorum Dadabase

[ Browse ]    [ List ]

Search MVD


Regular Expressions:     Match Case:



[ Home ]