9.6 Clusters

Clustering – enclosure within parens (...) – identifies the enclosed subpattern as a single entity. It causes the matcher to capture the submatch, or the portion of the string matching the subpattern, in addition to the overall match:

> (regexp-match #rx"([a-z]+) ([0-9]+), ([0-9]+)" "jan 1, 1970")
'("jan 1, 1970" "jan" "1" "1970")

Clustering also causes a following quantifier to treat the entire enclosed subpattern as an entity:

> (regexp-match #rx"(poo )*" "poo poo platter")
'("poo poo " "poo ")

The number of submatches returned is always equal to the number of subpatterns specified in the regexp, even if a particular subpattern happens to match more than one substring or no substring at all.

> (regexp-match #rx"([a-z ]+;)*" "lather; rinse; repeat;")
'("lather; rinse; repeat;" " repeat;")

Here, the *-quantified subpattern matches three times, but it is the last submatch that is returned.

It is also possible for a quantified subpattern to fail to match, even if the overall pattern matches. In such cases, the failing submatch is represented by #f

  > (define date-re
      ; match `month year' or `month day, year';
      ; subpattern matches day, if present
      #rx"([a-z]+) +([0-9]+,)? *([0-9]+)")
  > (regexp-match date-re "jan 1, 1970")
  '("jan 1, 1970" "jan" "1," "1970")
  > (regexp-match date-re "jan 1970")
  '("jan 1970" "jan" #f "1970")

9.6.1 Backreferences

Submatches can be used in the insert string argument of the procedures regexp-replace and regexp-replace*. The insert string can use \n as a backreference to refer back to the nth submatch, which is the substring that matched the nth subpattern. A \0 refers to the entire match, and it can also be specified as \&.

  > (regexp-replace #rx"_(.+?)_"
      "the _nina_, the _pinta_, and the _santa maria_"
      "*\\1*")
  "the *nina*, the _pinta_, and the _santa maria_"
  > (regexp-replace* #rx"_(.+?)_"
      "the _nina_, the _pinta_, and the _santa maria_"
      "*\\1*")
  "the *nina*, the *pinta*, and the *santa maria*"
  > (regexp-replace #px"(\\S+) (\\S+) (\\S+)"
      "eat to live"
      "\\3 \\2 \\1")
  "live to eat"

Use \\ in the insert string to specify a literal backslash. Also, \$ stands for an empty string, and is useful for separating a backreference \n from an immediately following number.

Backreferences can also be used within a #px pattern to refer back to an already matched subpattern in the pattern. \n stands for an exact repeat of the nth submatch. Note that \0, which is useful in an insert string, makes no sense within the regexp pattern, because the entire regexp has not matched yet that you could refer back to it.}

  > (regexp-match #px"([a-z]+) and \\1"
                  "billions and billions")
  '("billions and billions" "billions")

Note that the backreference is not simply a repeat of the previous subpattern. Rather it is a repeat of the particular substring already matched by the subpattern.

In the above example, the backreference can only match billions. It will not match millions, even though the subpattern it harks back to – ([a-z]+) – would have had no problem doing so:

  > (regexp-match #px"([a-z]+) and \\1"
                  "billions and millions")
  #f

The following example corrects doubled words:

  > (regexp-replace* #px"(\\S+) \\1"
      (string-append "now is the the time for all good men to "
                     "to come to the aid of of the party")
      "\\1")
  "now is the time for all good men to come to the aid of the party"

The following example marks all immediately repeating patterns in a number string:

  > (regexp-replace* #px"(\\d+)\\1"
      "123340983242432420980980234"
      "{\\1,\\1}")
  "12{3,3}40983{24,24}3242{098,098}0234"

9.6.2 Non-capturing Clusters

It is often required to specify a cluster (typically for quantification) but without triggering the capture of submatch information. Such clusters are called non-capturing. To create a non-capturing cluster, use (?: instead of ( as the cluster opener.

In the following example, a non-capturing cluster eliminates the “directory” portion of a given Unix pathname, and a capturing cluster identifies the basename.

But don’t parse paths with regexps. Use functions like split-path, instead.

  > (regexp-match #rx"^(?:[a-z]*/)*([a-z]+)$"
                  "/usr/local/bin/racket")
  '("/usr/local/bin/racket" "racket")

9.6.3 Cloisters

The location between the ? and the : of a non-capturing cluster is called a cloister. You can put modifiers there that will cause the enclustered subpattern to be treated specially. The modifier i causes the subpattern to match case-insensitively:

The term cloister is a useful, if terminally cute, coinage from the abbots of Perl.

> (regexp-match #rx"(?i:hearth)" "HeartH")
'("HeartH")

The modifier m causes the subpattern to match in multi-line mode, where . does not match a newline character, ^ can match just after a newline, and $ can match just before a newline.

  > (regexp-match #rx"." "\na\n")
  '("\n")
  > (regexp-match #rx"(?m:.)" "\na\n")
  '("a")
  > (regexp-match #rx"^A plan$" "A man\nA plan\nA canal")
  #f
  > (regexp-match #rx"(?m:^A plan$)" "A man\nA plan\nA canal")
  '("A plan")

You can put more than one modifier in the cloister:

> (regexp-match #rx"(?mi:^A Plan$)" "a man\na plan\na canal")
'("a plan")

A minus sign before a modifier inverts its meaning. Thus, you can use -i in a subcluster to overturn the case-insensitivities caused by an enclosing cluster.

  > (regexp-match #rx"(?i:the (?-i:TeX)book)"
                  "The TeXbook")
  '("The TeXbook")

The above regexp will allow any casing for the and book, but it insists that TeX not be differently cased.

top← prev up next →

1	Welcome to Racket
2	Racket Essentials
3	Built-In Datatypes
4	Expressions and Definitions
5	Programmer-Defined Datatypes
6	Modules
7	Contracts
8	Input and Output
9	Regular Expressions
10	Exceptions and Control
11	Iterations and Comprehensions
12	Pattern Matching
13	Classes and Objects
14	Units (Components)
15	Reflection and Dynamic Evaluation
16	Macros
17	Creating Languages
18	Performance
19	Running and Creating Executables
20	Compilation and Configuration
21	More Libraries
22	Dialects of Racket and Scheme
	Bibliography
	Index

9.1	Writing Regexp Patterns
9.2	Matching Regexp Patterns
9.3	Basic Assertions
9.4	Characters and Character Classes
9.5	Quantifiers
9.6	Clusters
9.7	Alternation
9.8	Backtracking
9.9	Looking Ahead and Behind
9.10	An Extended Example

9.6.1	Backreferences
9.6.2	Non-capturing Clusters
9.6.3	Cloisters