nx_vi_5.html

Regex Procedures

Last | Next



Various Regular Expression Procedures

Learn these 15 reserved symbols first, please.

  1. * is zero or more of preceding
  2. \ means escape
  3. | means or
  4. { } matches preceding X times
  5. [ ] matches preceding range
  1. ^ not range; also line's beginning
  2. $ is backreference; also line's end
  3. - range; also command line option
  4. =~ does match
  5. !~ doesn't match
  1. . is any single character
  2. + is 1 or more of preceding
  3. ( ) groups for backreference
  4. ? is 0 or 1 of preceding
  5. / / groups a match

Let's look closely at Regular Expressions for Linux shell programming first


*
Zero or more of preceding character.
X*   -- Problem! "Zero" X's would match just about anything!
XX*   -- Holy cow, this matches a single X followed by zero or more X's.
The asterisk may be used to delete multiple blank spaces from a file: s/ */ /g.
.*
"Period asterisk" combination is used frequently to match zero or more occurrences of any character.
The match is greedy in that it looks for the longest possible pattern.
So s/e.*e/ZZZ/ substitutes -- WITHIN THE CURRENT LINE -- the longest stretch between "e's" it can find.
E.g. The Unix operating system was pioneered by Ken
becomes ThZZZn -- it took the longest stretch it could find and makes its substitution; not the shortest.
So s/.*// deletes an entire line of text.
\
Escape a special character. In shell programming the backslash may escape parens and other key terms.
\<
Forces match to beginning of word.
\>
End of word.
\1
back references earlier use of (x) and means "use (x) here too". Then in the second compartment $1 accomplishes same thing.
|
xyz|abc
Matches either xyz or abc
(xy|ab)c
Either xyc or abc
{ . , . } or \{ . , . \}
Matches a precise number of characters: { min,max }
Like X{1,10} matches preceding a minimum of one times and a maximum of ten BUT it's greedy too: if it has a chance, it will take all ten: XXXXXXXXXX
[A-Za-z]{3,6}
This matches alphas from three characters to six characters.
.{10}
Here we match exactly ten characters.
Z{4,}
At least four "Z's" must be matched.
[ ]
To match a choice of characters.
[tT]   -- Lower or uppercase "T" match.
[0-9]   -- Range must be from low to high.
[^0-9]   -- Caret means Not.
^
Start of line or [^not].
^$   -- Matches empty line.
,
as used in { 3 , 5 }
-
Range as used in the [ ... ] construct.
$
End of line.
$1 -- contains backreference made with parens
( ... )
The parens save a match for later re-use. You can recover the saved match in $1, $2, $3 ... $9 (or more in Perl).
(xyz)+
One or more occurrences of whatever xyz matches.
(xyz)?
Zero or one occurrences of whatever xyz matches.
(xyz)*
Zero or more occurrences of whatever xyz matches.
+
One or more occurrences of preceding character.
?
Zero or one of preceding character.
.
Any character.
/ ... /
Encloses search pattern.

More

Study Guide

A summary of some Regular Expression reserved characters. (Must confirm)

*
preceding matched zero or more
\
escape
|
or
{   }
preceding n times
[   ]
preceding range
^
first or not range
-
option or range
$
x$
match end
$1
saved backreference
=~
sets var equal to substitution and evals t/f
!~
sets var not equal to substitution and evals t/f
(   )
makes backreference
.
match single character
+
preceding matched one or more
?
preceding matched zero or one
/   /
match (m //), substitute (s / / /) or translate (tr / / /)
modifiers
insensitive; once; multiple lines; single line; xtended
arguments
-i.bu adds extension; -prints results to file; -execute Perl

Regular Expression terms ... but not reserved words.

backreference
save for later re-use
backtrack
regex works by scanning and partially rescanning the line
compartment
compartment one and two means code between the slashes, as in s/compartment one/compartment two/
match
left-side to m/^$/
regex
Regular Expression
result status
question its consistency, but using this code
$status = ($words =~ /error/g);
$status becomes "1" (YES) or null (empty)
substitute
replaces left with right s/left/right/
same as above s"left"right"
translate
e.g. turns lower case to upper tr/[A-Z]/right/

Escapes

We touched escape pairs earlier. Remember this is the key commonly called "backslash".
\1
backreference match construct
\b
word boundary
\d
digit
\s
space
\w
word
\A
beginning
\B
not word boundary
\D
not digit
\S
not space
\W
not word
\Z
end

Some common usages:

s/<\/*.>//g
strips simple HTML tags
s/<[^>]*>|$//gi
strips HTML
m/\bapples?\b/i
boundary apple and boundary apples, case insensitive
s/.*//
deletes line

Other ...

How are new lines treated?
Linux
"LF" \012
DOS
"CRLF" \015\012
Mac
"CR" \015

  1. G. Stafford - RegEx
    from www.uniforum.chi.il.us