9.2 Matching Regexp Patterns

The regexp-match-positions function takes a regexp pattern and a text string, and it returns a match if the regexp matches (some part of) the text string, or #f if the regexp did not match the string. A successful match produces a list of index pairs.

Examples:
> (regexp-match-positions #rx"brain" "bird")

#f

> (regexp-match-positions #rx"needle" "hay needle stack")

'((4 . 10))

In the second example, the integers 4 and 10 identify the substring that was matched. The 4 is the starting (inclusive) index, and 10 the ending (exclusive) index of the matching substring:

> (substring "hay needle stack" 4 10)

"needle"

In this first example, regexp-match-positions’s return list contains only one index pair, and that pair represents the entire substring matched by the regexp. When we discuss subpatterns later, we will see how a single match operation can yield a list of submatches.

The regexp-match-positions function takes optional third and fourth arguments that specify the indices of the text string within which the matching should take place.

> (regexp-match-positions
   #rx"needle"
   "his needle stack -- my needle stack -- her needle stack"
   20 39)

'((23 . 29))

Note that the returned indices are still reckoned relative to the full text string.

The regexp-match function is like regexp-match-positions, but instead of returning index pairs, it returns the matching substrings:

> (regexp-match #rx"brain" "bird")

#f

> (regexp-match #rx"needle" "hay needle stack")

'("needle")

When regexp-match is used with byte-string regexp, the result is a matching byte substring:

> (regexp-match #rx#"needle" #"hay needle stack")

'(#"needle")

A byte-string regexp can be applied to a string, and a string regexp can be applied to a byte string. In both cases, the result is a byte string. Internally, all regexp matching is in terms of bytes, and a string regexp is expanded to a regexp that matches UTF-8 encodings of characters. For maximum efficiency, use byte-string matching instead of string, since matching bytes directly avoids UTF-8 encodings.

If you have data that is in a port, there’s no need to first read it into a string. Functions like regexp-match can match on the port directly:

> (define-values (i o) (make-pipe))
> (write "hay needle stack" o)
> (close-output-port o)
> (regexp-match #rx#"needle" i)

'(#"needle")

The regexp-match? function is like regexp-match-positions, but simply returns a boolean indicating whether the match succeeded:

> (regexp-match? #rx"brain" "bird")

#f

> (regexp-match? #rx"needle" "hay needle stack")

#t

The regexp-split function takes two arguments, a regexp pattern and a text string, and it returns a list of substrings of the text string; the pattern identifies the delimiter separating the substrings.

> (regexp-split #rx":" "/bin:/usr/bin:/usr/bin/X11:/usr/local/bin")

'("/bin" "/usr/bin" "/usr/bin/X11" "/usr/local/bin")

> (regexp-split #rx" " "pea soup")

'("pea" "soup")

If the first argument matches empty strings, then the list of all the single-character substrings is returned.

> (regexp-split #rx"" "smithereens")

'("" "s" "m" "i" "t" "h" "e" "r" "e" "e" "n" "s" "")

Thus, to identify one-or-more spaces as the delimiter, take care to use the regexp #rx" +", not #rx" *".

> (regexp-split #rx" +" "split pea     soup")

'("split" "pea" "soup")

> (regexp-split #rx" *" "split pea     soup")

'("" "s" "p" "l" "i" "t" "" "p" "e" "a" "" "s" "o" "u" "p" "")

The regexp-replace function replaces the matched portion of the text string by another string. The first argument is the pattern, the second the text string, and the third is either the string to be inserted or a procedure to convert matches to the insert string.

> (regexp-replace #rx"te" "liberte" "ty")

"liberty"

> (regexp-replace #rx"." "racket" string-upcase)

"Racket"

If the pattern doesn’t occur in the text string, the returned string is identical to the text string.

The regexp-replace* function replaces all matches in the text string by the insert string:

> (regexp-replace* #rx"te" "liberte egalite fraternite" "ty")

"liberty egality fratyrnity"

> (regexp-replace* #rx"[ds]" "drracket" string-upcase)

"Drracket"