diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2010-01-05 18:27:50 -0800 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2010-01-05 18:27:50 -0800 |
commit | b553f54bca43cda8ac0eaca5c21469dd36415d2b (patch) | |
tree | f2ccf026256a0cc8ee869ab62f1e964872c50a23 /txr.1 | |
parent | cd5959194340d95f08abf8230d7f5e5a72728eb9 (diff) | |
download | txr-b553f54bca43cda8ac0eaca5c21469dd36415d2b.tar.gz txr-b553f54bca43cda8ac0eaca5c21469dd36415d2b.tar.bz2 txr-b553f54bca43cda8ac0eaca5c21469dd36415d2b.zip |
Implemented the regular expression ~ and & operators.
This turns out to be easy to do in NFA land.
The complement of an NFA has exactly the same number
and configuration of states and transitions, except
that the states have an inverted meaning; and furthermore,
failed character transitions are routed to an extra
state (which in this impelmentation is permanently
allocated and shared by all regexes). The regex &
is implemented trivially using DeMorgan's.
Also, bugfix: regular expressions like A|B|C are allowed
now by the syntax, rather than constituting syntax error.
Previously, this would have been entered as (A|B)|C.
Diffstat (limited to 'txr.1')
-rw-r--r-- | txr.1 | 46 |
1 files changed, 35 insertions, 11 deletions
@@ -643,17 +643,21 @@ If RE is a regular expression, then so is (RE). The contents of parentheses denote one regular expression unit, so that for instance in (RE)*, the * operator applies to the entire parenthesized group. .IP (RE)? -optionally matches the preceding regular expression (RE). +optionally match the preceding regular expression (RE). .IP (RE)+ -matches the preceding expression one or more times. +match the preceding expression one or more times. .IP (RE)* -matches the preceding expression zero or more times. +match the preceding expression zero or more times. +.IP ~(RE) +match the complement of the preceding expression; i.e. match +those texts that (RE) does not match. .IP (RE1)(RE2) Two consecutive regular expressions denote catenation: the left expression must match, and then the right. - .IP (RE1)|(RE2) -matches either the expression RE1 or RE2. +match either the expression RE1 or RE2. +.IP (RE1)&(RE2) +match both the expression RE1 and RE2 simultaneously. .PP Any of the special characters, including the delimiting /, can be escaped with @@ -674,14 +678,34 @@ The postfix operators ?, + and * have the second highest precedence, and associate left to right, so that in A+?*, the * applies to A+?, and the ? applies to A+. -Catenation is on the next lower precedence rung, so that AB? means "match A, -and then optionally B" not "match A and B, as one optional unit". The latter -must be written (AB)? using parentheses to override precedence. +The unary complement operator has the next lower precedence, so +that ~A* means the ~(A*): "match the all text that is not matched by zero +or more repetitions of A", not "match zero or more times the text +not matched by A". + +Catenation is on the next lower precedence rung, so that AB? means A(B?), or +"match A, and then optionally B", not "match A and B, as one optional +unit". The latter must be written (AB)? using parentheses to override +precedence. + +The precedence of conjunction (&) is lower than catenation, but not as low as +that of disjunction, thus AB The disjunction operator | has the lowest precedence, lower than catenation. -Thus abc|def means "match abc, or match def". The meaning "match ab, -then c or d, then ef" must be expressed as ab(c|d)ef, or using -a character class: ab[cd]ef. +Thus ABC|DEF means "match ABC, or match DEF". The meaning "match AB, +then C or D, then EF" must be expressed as AB(C|D)EF, or using +a character class: AB[CD]EF. + +The regular expression + + ABC*&~DEF+|Z + +Is parsed as if it were grouped with parentheses like this: + + ((AB(C*))&(~(DE(F+))))|Z + +The main constituent is the | operator, whose left side is an & expression, +et cetera. In .b txr, |