From 25e0952bbee726a4de283b2ae6efa49b49c5cf0b Mon Sep 17 00:00:00 2001
From: Kaz Kylheku <kaz@kylheku.com>
Date: Thu, 14 Jan 2010 14:37:44 -0800
Subject: * txr.1: Fix accidental .b, which should have been .B. Revised
 description of regex operators. Added section on intersection and complement,
 which may not be familiar to regex users.

---
 txr.1 | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 120 insertions(+), 9 deletions(-)

(limited to 'txr.1')

diff --git a/txr.1 b/txr.1
index 25f8ac51..4ced9699 100644
--- a/txr.1
+++ b/txr.1
@@ -421,7 +421,7 @@ encounters an invalid bytes in the UTF-8 input, what happens depends on the
 context in which this occurs. In a query, comments are read without regard
 for encoding, so invalid encoding bytes are not detected. A comment is
 simply a sequence of bytes terminated by a newline.  Invalid
-encoding bytes in signficant query text are diagnosed as syntax errors.
+encoding bytes in significant query text are diagnosed as syntax errors.
 When the scanner is faced with input that isn't a valid multibyte character, it
 issues an error message, skips one byte, and resumes scanning.
 
@@ -647,17 +647,23 @@ optionally match the preceding regular expression (RE).
 .IP (RE)+
 match the preceding expression one or more times.
 .IP (RE)*
-match the preceding expression zero or more times.
+match the preceding expression zero or more times. This operator
+is sometimes called the "Kleene operator".
 .IP ~(RE)
 match the complement of the preceding expression; i.e. match
-those texts that (RE) does not match.
+those texts that (RE) does not match. This operator is called complement,
+or logical not.
 .IP (RE1)(RE2)
 Two consecutive regular expressions denote catenation:
 the left expression must match, and then the right.
 .IP (RE1)|(RE2)
-match either the expression RE1 or RE2.
+match either the expression RE1 or RE2. This operator is called union,
+logical or, or disjunction.
 .IP (RE1)&(RE2)
-match both the expression RE1 and RE2 simultaneously.
+match both the expression RE1 and RE2 simultaneously; i.e. the
+matching text must be one of the texts which are in the intersection of the set
+of texts matched by RE1 and the set matched by RE2. This operator is called
+intersection, logical and, or conjunction.
 
 .PP
 Any of the special characters, including the delimiting /,  can be escaped with
@@ -688,10 +694,10 @@ Catenation is on the next lower precedence rung, so that AB? means A(B?), or
 unit".  The latter must be written (AB)?  using parentheses to override
 precedence.
 
-The precedence of conjunction (&) is lower than catenation, but not as low as
-that of disjunction, thus AB
+The precedence of intersection (&) is lower than that of catenation, but not as
+low as that of union, thus AB
 
-The disjunction operator | has the lowest precedence, lower than catenation.
+The union operator (|) has the lowest precedence, lower than catenation.
 Thus ABC|DEF means "match ABC, or match DEF". The meaning "match AB,
 then C or D, then EF" must be expressed as AB(C|D)EF, or using
 a character class: AB[CD]EF.
@@ -708,7 +714,7 @@ The main constituent is the | operator, whose left side is an & expression,
 et cetera.
 
 In
-.b txr,
+.B txr,
 regular expression matches do not span multiple lines. There is no way
 to match a newline character since it's simply not internally represented in
 the data.
@@ -747,6 +753,111 @@ regular expression matches. The regular expression matches anywhere,
 including the empty string after the last character, which is
 the rightmost place. Thus variable A fetches the entire line.
 
+.SS Notes on Exotic Regular Expressions
+
+Users familiar with regular expressions may not be familiar with the complement
+and intersection operators, which are often absent from text processing tools
+that support regular expressions.  
+
+Regexp intersection is not essential; it may be obtained from complement and
+union as follows, since DeMorgan's law applies to regular expression algebra:
+(R1)&(R2) = ~(~(R1)|~(R2)). (The complement of the union of the complements of
+R1 and R2 constitutes the intersection. This law works because the regular
+expression operators denote set operations in a straightforward way. A regular
+expression denotes a set of strings (a potentially infinite one) in a condensed
+way. The union of two regular expressions R1 | R2 denotes the union of the set
+of texts denoted by R1 and that denoted by R2. Similarly R1 & R2 denotes a set
+intersection, and ~R denotes a set complement. Thus algebraic laws
+that apply to set operations apply to regular expressions. It's useful to keep
+in mind this relationship between regular expressions and sets in understanding
+intersection and complement.
+
+Given a finite set of strings, like the set { "abc", "def" }, which corresponds
+to the regular expression (abc|def), the complement is set which contains an
+infinite number of strings: it consists of all possible strings except "abc"
+and "def". It includes the empty string, all strings of length 1, all strings
+of length 2, all strings of length 3 other than "abc" and "def", all strings of
+length 4, etc.  This means that a "harmless looking" expression like
+~(abc|def) can actually match arbitrarily long inputs.
+
+How about matching only three-character-long strings other than "abc" or "def"?
+To express this, regex intersection can be used: these strings are the
+intersection of the set of all three-character strings, and the set of all
+strings which are not "abc" or "def". The straightforward set-based reasoning
+leads us to this: ...&~(abc|def).   This A&~B idiom is also called set
+difference, sometimes notated with a minus sign: A-B (which is not
+supported in 
+.B txr
+regular expression syntax).  Elements which are in the set A, but not B, are
+those elements which are in the intersection of A with the complement of B.
+This is similar to the arithmetic rule A - B = A + -B: subtraction is
+equivalent to addition of the additive inverse. Set difference is a useful
+tool: it enables us to write a positive match which captures a more general set
+than what is intended (but one whose regular expression is far simpler
+than a positive match for the exact set we want), then we can intersect
+intersect this over-generalized set with the complemented set of
+another regular expression which matches the particulars that we wish excluded.
+
+It is important to understand that any regular expression which uses the
+complement or intersection operator can be replaced by an expression which
+matches the same texts without using these operators. But such equivalent
+expressions are often much more complicated, cumbersome, and difficult to
+develop, debug and understand.  Such equivalent expressions do not actually
+. B express 
+the concept of what they match; they merely capture the equivalent set.
+The use of complement and intersection can lead to natural expressions.
+
+For instance, using complement, we can write a straightforward regular
+expression which matches C language comments. A C language
+comment is the digraph /*, followed by any string which does not contain the
+Examples of valid comments are /**/, /* abc */ or /***/. But C
+comments do not nest (cannot contain comments), so that 
+/* /* nested */ */ actually consists of the comment /* /* nested */,
+which is followed by the trailing junk */.
+Our simple characterization of interior part of a C comment as a string
+which does not contain the terminating digraph makes use of the
+complement, and can be expressed using the complemented regular expression like
+this: (~.*[*][/].*).  That is to say, strings which contain */ are matched by
+the expression .*[*][/].*: zero or more arbitrary characters, followed by
+*/, followed by zero or more arbitrary characters. Therefore, the complement of this expression matches all other strings: those which do not contain */.
+These strings up the inside of a C comment between the /* and */.
+
+The equivalent simple regex is much more complicated.
+Without complement, we must somehow write a positive match for all strings such
+that we avoid matching */. Obviously, sequences of characters other than *
+are included: [^*]*.   Occurrences of * are also allowed, but only if followed
+by something other than a slash, so let's include this via union:
+([^*]|[*][^/])*. Alas, already, we have a bug in this expression. The
+subexpression [*][^/] can match "**", since a * is not a /.  If the next
+character in the input is /, we missed a comment close.  To fix the problem we
+revise to this: ([^*]|[*][^*/])* (the interior of a C language comment is a any
+mixture of zero or more non-asterisks, or digraphs consisting of an asterisk
+followed by something other than a slash or another asterisk).  Oops, now we
+have a problem again. What if two asterisks occur in a comment?  They are not
+matched by [^*], and they are not matched by [*][^*/]. Actually, our regex must
+not simply match asterisk-non-asterisk digraphs, but rather sequences of one or
+more asterisks followed by a non-asterisk: ([^*]|[*]*[^*/])*.  This is still
+not right, because, for instance, it fails to match the interior of a comment
+which is terminated by asterisks, including the simple test cases where
+the comment interior is nothing but asterisks.  We have no provision in our
+expression for this case; the expression requires all runs of asterisks to be
+followed by something which is not a slash or asterisk. The way to fix this is
+to add on a subexpression which optionally matches a run of zero or more
+interior asterisks before the comment close: ([^*]|[*]*[^*/])*[*]*.   Thus the
+semi-final regular expression is: [/][*]([^*]|[*]*[^*/])*[*]*[*][/]. (A C
+comment is an interior string enclosed in /* */, where this interior part
+consists of a mixture of non-asterisk characters, as well as runs of asterisk
+characters which are terminated by a character other than a slash, except
+for possibly one rightmost run of asterisks which extends to the end of the
+interior, touching the comment close. Phew!) One final simplification is
+possible: the tail part [*]*[*][/]  can be reduced to [*]+[/] such that the
+final run of asterisks is regarded as part of an extended comment terminator
+which consists of one or more asterisks followed by a slash.  Clearly, working
+without regular expression complement is not impossible, but difficult and
+error-prone.  Complemented matches without complement may require regex
+experience combined with cunning, perseverance and possibly multiple cycles of
+trial-and-error involving numerous test cases.
+
 .SS Directives
 
 The general syntax of a directive is:
-- 
cgit v1.2.3