|
PLT MzScheme: Language Manual
|
| ||
Figure 1: Grammar for regular expressions | ||
|
The format of a regular expression is specified by the grammar in Figure 1. A few subtle points about the regexp language are worth noting:
When an opening square bracket (``['') that starts a range is
immediately followed by a closing square bracket (``]''), then the
closing square bracket is part of the range, instead of ending an
empty range. For example, "[]a]"
matches any string that
contains a lowercase ``a'' or a closing square bracket. A dash
(``-'') at the start or end of a range is treated specially in the
same way.
When a caret (``^'') or dollar sign (``$'') appears in
the middle of a regular expression (not in a range), the resulting
regexp is legal even though it is usually not matchable. For example,
"a$b"
is unmatchable, because no string can contain the letter
``b'' after the end of the string. In contrast, "a$b*"
matches
any string that ends with a lowercase ``a'', since zero ``b''s will
match the part of the regexp after ``$''.
A backslash (``\'') in a regexp pattern specified
with a Scheme string literal must be protected with an additional
backslash. For example, the string "\\."
describes a
pattern that matches any string containing a period. In this case,
the first backslash protects the second to generate a Scheme string
containing two characters; the second backslash (which is the first
slash in the actual string value) protects the period in the regexp
pattern.
The regular expression procedures are:
(regexp
string
)
takes a string representation of a regular
expression and compiles it into a regexp value. Other regular
expression procedures accept either a string or a regexp value as the
matching pattern. If a regular expression string is used multiple
times, it is faster to compile the string once to a regexp value and
use it for repeated matches instead of using the string each time.
The
procedure (see section 6.2.4) returns
the source string for a regexp value.object-name
(regexp?
v
)
returns #t
if v
is a regexp value
created by
, regexp
#f
otherwise.
(regexp-match
pattern string
[start-k end-k output-port
])
attempts to match pattern
(a string or a regexp value) to a
portion of string
; see below for information on using an input
port in place of string
.
The optional start-k
and end-k
arguments select a
substring of string
for matching, and the default is the entire
string. The end-k
argument can be #f
, which is the
same as not supplying end-k
. The matcher finds a portion of
string
that matches pattern
and is closest to the start
of the selected substring.
If the match fails, #f
is returned. If the match succeeds, a
list containing strings, and possibly #f
, is returned. The
first string in this list is the portion of string
that matched
pattern
. If two portions of string
can match
pattern
, then the match that starts earliest is found.
Additional strings are returned in the list if pattern
contains
parenthesized sub-expressions; matches for the sub-expressions are
provided in the order of the opening parentheses in
pattern
. When sub-expressions occur in branches of an ``or''
(``|''), in a ``zero or more'' pattern (``*''), or in a
``zero or one'' pattern (``?''), a #f
is returned for
the expression if it did not contribute to the final match. When a
single sub-expression occurs in a ``zero or more'' pattern
(``*'') or a ``one or more'' pattern (``+'') and is used
multiple times in a match, then the rightmost match associated with
the sub-expression is returned in the list.
If the optional output-port
is provided, the part of
string
that precedes the match is written to the port. All of
string
up to end-k
is written to the port if no match is
found. This functionality is not especially useful, but it is
provided for consistency with
on input ports.regexp-match
(regexp-match
pattern input-port
[start-k end-k output-port
])
is similar to
with a string (see above), except
that the match is found in the stream of characters produced by
regexp-match
input-port
. The optional start-k
argument indicates the
number of characters to skip before matching pattern
, and
end-k
indicates the maximum number of characters to consider
(including skipped characters). The end-k
argument can be
#f
, which is the same as not supplying end-k
. The
default is to skip no characters and read until the end-of-file if
necessary. If the end-of-file is reached before start-k
characters are skipped, the match fails.
In pattern
, a start-of-string caret (``^'') refers to
the first read position after skipping, and the end-of-string dollar
sign (``$'') refers to the end-k
th read character or the end
of file, whichever comes first.
The optional output-port
receives all characters that precede a
match in the input port, or up to end-k
characters (by default
the entire stream) if no match is found.
When matching an input port stream, all characters up to and
including the match are eventually read from the port, but matching
proceeds by first peeking characters from the port (using
; see section 11.2.1), and then
(re-)reading characters to discard them after the match result is
determined. The matcher peeks in blocking mode only as far as
necessary to determine a match, but it may peek extra characters to
fill an internal buffer if immediately available (i.e., without
blocking). Greedy repeat operators in peek-string-avail!
pattern
, such as ``*'' or
``+'', tend to force reading the entire content of the port to
determine a match.
If the port is read simultaneously by another thread, or if the port is a custom port with inconsistent reading and peeking procedures (see section 11.1.6), then the characters that are peeked and used for matching may be different than the characters read and discarded after the match completes. The matcher inspects only the peeked characters.
(regexp-match-positions
pattern string-or-input-port
[start-k end-k output-port
])
is like
, but returns a list of number pairs
(and regexp-match
#f
) instead of a list of strings. Each pair of numbers
refers to a range of characters in string-or-input-port
in a
-compatible manner for strings, independent of
substring
start-k
. In the case of an input port, the returned positions
indicate the number of characters that were read before the first
matching character.
(regexp-match-peek
pattern input-port
[start-k end-k
])
is like
on input ports, but only peeks
characters from regexp-match
input-port
instead of reading them.
(regexp-match-peek-positions
pattern input-port
[start-k end-k
])
is like
on input ports, but only
peeks characters from regexp-match-positions
input-port
instead of reading them.
(regexp-replace
pattern string insert-string
)
performs a
match using pattern
on string
and then returns a string
in which the matching portion of string
is replaced with
insert-string
. If pattern
matches no part of
string
, then string
is returned unmodified.
If insert-string
contains ``&'', then ``&'' is replaced with
the matching portion of string
before it is substituted into
string
. If insert-string
contains
``\n
'' (for some integer n
), then it is
replaced with the n
th matching sub-expression from
string
.15 ``&''
and ``\0'' are synonymous. If the n
th sub-expression
was not used in the match or if n
is greater than the number of
sub-expressions in pattern
, then ``\n
'' is
replaced with the empty string.
A literal ``&'' or ``\'' is specified as
``\&'' or ``\\'', respectively. If
insert-string
contains ``\$'', then
``\$'' is replaced with the empty string. (This can be
used to terminate a number n
following a backslash.) If a
``\'' is followed by anything other than a digit, ``&'',
``\'', or ``$'', then it is treated as ``\0''.
(regexp-replace*
pattern string insert-string
)
is the same as
, except that every instance of regexp-replace
pattern
in
string
is replaced with insert-string
. Only
non-overlapping instances of pattern
in the original
string
are replaced, so instances of pattern
within
inserted strings are not replaced recursively.
Examples:
(define r (regexp
"(-[0-9]*)+")) (regexp-match
r "a-12--345b") ; =>'("-12--345" "-345")
(regexp-match-positions
r "a-12--345b") ; =>'((1 . 10) (5 . 10))
(regexp-match
"x+" "12345") ; =>#f
(regexp-replace
"mi" "mi casa" "su") ; =>"su casa"
(define r2 (regexp
"([Mm])i ([a-zA-Z]*)")) (define insert "\\1y \\2") (regexp-replace
r2 "Mi Casa" insert) ; =>"My Casa"
(regexp-replace
r2 "mi cerveza Mi Mi Mi" insert) ; =>"my cerveza Mi Mi Mi"
(regexp-replace*
r2 "mi cerveza Mi Mi Mi" insert) ; =>"my cerveza My Mi Mi"
15 The backslash is a character in the string, so
an extra backslash is required to specify the string as a Scheme
constant. For example, the Scheme constant
"\\1"
is ``\1''.