summaryrefslogtreecommitdiffstats
path: root/txr.1
diff options
context:
space:
mode:
Diffstat (limited to 'txr.1')
-rw-r--r--txr.1319
1 files changed, 300 insertions, 19 deletions
diff --git a/txr.1 b/txr.1
index 191def67..e1a67248 100644
--- a/txr.1
+++ b/txr.1
@@ -21,7 +21,7 @@
.\"IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
.\"WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
-.TH txr 1 2009-09-09 "txr v. 013" "Text Extraction Utility"
+.TH txr 1 2009-09-09 "txr v. 014" "Text Extraction Utility"
.SH NAME
txr \- text extractor
.SH SYNOPSIS
@@ -601,8 +601,9 @@ The general syntax of a directive is:
@EXPR
where expr is a parenthesized list of subexpressions. A subexpression
-is an symbol, number, regular expression, or a parenthesized expression.
-So, examples of valid directives are:
+is an symbol, number, string literal, character literal, regular expression, or
+a parenthesized expression. So, examples of syntactically valid directives
+are:
@(banana)
@@ -610,11 +611,18 @@ So, examples of valid directives are:
@( a (b (c d) (e ) ))
+ @("apple" 'b' 3)
+
@(a /[a-z]*/ b)
A symbol is lexically the same thing as a variable and the same rules
apply. Tokens that look like numbers are treated as numbers.
+String and character literals are delimited by double and single quotes,
+respectively, and may not span multiple lines. Character literals must contain
+exactly one character. Character and numeric escapes may be used within
+literals to escape the quotes, and to denote control characters.
+
Some directives are involved in structuring the overall syntax of the query.
There are syntactic constraints that depend on the directive. For instance the
@@ -630,7 +638,7 @@ Continue matching in another file.
.IP @(block)
The remaining query is treated as an anonymous or named block.
Blocks may be referenced by @(accept) and @(fail) directives.
-Blocks are discussed in the section Blocks below.
+Blocks are discussed in the section BLOCKS below.
.IP @(skip)
Treat the remaining query as a subquery unit, and search the lines of
@@ -651,7 +659,15 @@ Match some clauses in parallel. Each one must match.
Match some clauses in parallel. None must match.
.IP @(maybe)
-Match some clauses in parallel. None must match.
+Match some clauses in parallel, which may or may not match.
+No failure occurs if none match.
+
+.IP @(cases)
+Match some clauses sequentially, stopping if one of them
+matches successfully.
+
+.IP @(define NAME ( ARGUMENTS ...))
+Introduces a function. Functions are discussed in the FUNCTIONS section below.
.IP @(collect)
Search the data for multiple matches of a clause. Collect the
@@ -671,17 +687,17 @@ Separator of clauses for @(some), @(all), and @(none).
Equivalent to @(and). Choice is stylistic.
.IP @(end)
-Required terminator for @(some), @(all), @(none), @(maybe), @(collect),
-@(output), and @(repeat).
+Required terminator for @(some), @(all), @(none), @(maybe), @(cases),
+@(collect), @(output), and @(repeat).
.IP @(fail)
Terminate the processing of a block, as if it were a failed match.
-Blocks are discussed in the section Blocks below.
+Blocks are discussed in the section BLOCKS below.
.IP @(accept)
Terminate the processing of a block, as if it were a successful match.
What bindings emerge may depend on the kind of block: collect
-has special semantics. Blocks are discussed in the section Blocks below.
+has special semantics. Blocks are discussed in the section BLOCKS below.
.IP @(flatten)
Normalizes a set of specified variables to one-dimensional lists. Those
@@ -890,7 +906,7 @@ line "a dark". The @(some) clause combines the text line "it",
and a @(none) clause which contains just one clause consisting of
the line "was".
-The semantics of the some, all, none and maybe directives is:
+The semantics of the some, all, none, maybe and cases directives is:
.IP @(all)
Each of the clauses is matched at the current position. If any of the
@@ -914,12 +930,20 @@ The directive succeeds even if all of the clauses fail.
Whatever bindings are found in any of the clauses are
retained.
-When a @(some) or @(all) directive matches successfully, or a @(maybe)
-directive matches something, the query advances by the greatest number of lines
-matched in any of the subclauses. For instance if there are two subclauses, and
-one of them matches three lines, but the other one matches five lines, then the
-overall clause is considered to have made a five line match at its position. If
-more directives follow, they begin matching five lines down from that position.
+.IP @(cases)
+The clauses are matched, in order, at the current position.
+If any clause matches, the matching stops and the bindings
+collected from that clause are retained. Any remaining clauses
+after that one are not processed. If no clause matches, the
+directive fails, and produces no bindings.
+
+When a @(some), @(all), or @(cases) directive matches successfully, or a
+@(maybe) directive matches in at least one of its clauses, the query advances
+by the greatest number of lines matched in any of the subclauses. For instance
+if there are two subclauses, and one of them matches three lines, but the other
+one matches five lines, then the overall clause is considered to have made a
+five line match at its position. If more directives follow, they begin matching
+five lines down from that position.
.SS The Collect Directive
@@ -1253,6 +1277,15 @@ to match B, or the bind fails. Matching means that either
- A and B are lists and are either identical, or one is
found as substructure within the other.
+The right hand side does not have to be a variable. It may be some other
+object, like a string, or list of strings, et cetera. For instance
+
+ @(bind A "ab\tc")
+
+will bind the string "ab\tc" (the letter a, b, a tab character, and c)
+to the variable A if A is unbound. If A is bound, this will fail unless
+A already contains an identical string.
+
The left hand side of a bind can be a nested list pattern containing variables.
The last item of a list at any nesting level can be preceded by a dot, which
means that the variable matches the rest of the list from that position.
@@ -1280,7 +1313,8 @@ The @(block NAME) directive introduces a named block, except when the name is
the word nil. The @(block) directive introduces an unnamed block, equivalent
to @(block nil).
-The @(skip) and @(collect) directives introduce implicit anonymous blocks.
+The @(skip) and @(collect) directives introduce implicit anonymous blocks,
+as do function bodies.
.SS Block Scope
@@ -1384,12 +1418,12 @@ that block until this point emerge from that block.
.IP @(accept)
Immediately terminate the innermost enclosing anonymous block, as if
-that block successfully mached. Any bindings established within
+that block successfully matched. Any bindings established within
that block until this point emerge from that block.
If the implicit block introduced by @(skip) is terminated in this manner,
this has the effect of causing the skip itself to succeed, as if
-all of the trailing material succesfully matched.
+all of the trailing material successfully matched.
If the implicit block associated with a @(collect) is terminated this way,
then the collection stops. All bindings collected in the current iteration of
@@ -1544,6 +1578,253 @@ The second clause grabs four lines, which is the longest match.
And so, the next line of input available for matching is 5, which goes
to the @second variable.
+.SH FUNCTIONS
+
+.SS Introduction
+
+.B txr
+functions allow a query to be structured to avoid repetition.
+On a theoretical note, because
+.B txr
+functions support recursion, functions enable txr to match some
+kinds of patterns which exhibit self-embedding, or nesting,
+and thus cannot be matched by a regular language.
+
+Functions in
+.B txr
+are not exactly like functions in mathematics or functional languages, and are
+not like procedures in imperative programming languages. They are not exactly
+like macros either. What it means for a
+.B txr
+function to take arguments and produce a result is different from
+the conventional notion of a function.
+
+A
+.B txr
+function may have one or more parameters. When such a function is invoked, an
+argument must be specified for each parameter. However, a special behavior is
+at play here. Namely, some or all of the argument expressions may be unbound
+variables. In that case, the corresponding parameters behave like unbound
+variables also. Thus
+.B txr
+function calls can transmit the "unbound" state from argument to parameter.
+
+It should be mentioned that functions have access to all bindings that are
+visible in the caller; functions may refer to variables which are not
+mentioned in their parameter list.
+
+With regard to returning,
+.B txr
+functions are also unconventional. If the function fails, then the function
+call is considered to have failed. The function call behaves like a kind of
+match; if the function fails, then the call is like a failed match.
+
+When a function call succeeds, then the bindings emanating from that function
+are processed specially. Firstly, any bindings for variables which do not
+correspond to one of the function's parameters are thrown away. Functions may
+internally bind arbitrary variables in order to get their job done, but only
+those variables which are named in the function argument list may propagate out
+of the function call. Thus, a function with no arguments can only indicate
+matching success or failure, but not produce any bindings. Secondly,
+variables do not propagate out of the function directly, but undergo
+a renaming. For each parameter which went into the function as an unbound
+variable (because its corresponding argument was an unbound variable),
+if that parameter now has a value, that value is bound onto the corresponding
+argument.
+
+Example:
+
+ @(define collect_words (list))
+ @(coll)@{list /[^ \t]/}@(end)
+ @(end)
+
+The above function "collect_words" contains a query which collects words from a
+line (sequences of characters other than space or tab), into the list variable
+called "list". This variable is named in the parameter list of the function,
+therefore, its value, if it has one, is permitted to escape from the function
+call.
+
+Suppose the input data is:
+
+ Fine summer day
+
+and the function is called like this:
+
+ @(collect_words wordlist)
+
+The result is:
+
+ wordlist[0]=Fine
+ wordlist[1]=summer
+ wordlist[1]=day
+
+How it works is that in the function call @(collect_words wordlist),
+"wordlist" is an unbound variable. The parameter corresponding to that
+unbound variable is the parameter "list". Therefore, that parameter
+is unbound over the body of the function. The function body collects the
+words of "Fine summer day" into the variable "list", and then
+yields the that binding. Then the function call completes by
+noticing that the function parameter "list" now has a binding, and
+that the corresponding argument "wordlist" has no binding. The binding
+is thus transferred to the "wordlist" variable. After that, the
+bindings produced by the function are thrown away. The only enduring
+effects are:
+
+.IP -
+the function matched and consumed some input; and
+
+.IP -
+the function succeeded; and
+
+.IP -
+the wordlist variable now has a binding.
+.PP
+
+Another way to understand the parameter behavior is that function
+parameters behave like proxies which represent their arguments. If an argument
+is an established value, such as a character string or bound variable, the
+parameter is a proxy for that value and behaves just like that value. If an
+argument is an unbound variable, the function parameter acts as a proxy
+representing that unbound variable. The effect of binding the proxy is
+that the variable becomes bound, an effect which is settled when the
+function goes out of scope.
+
+Within the function, both the original variable and the proxy are
+visible simultaneously, and are independent. What if a function binds both of
+them? Suppose a function has a parameter called P, which is called
+with an argument A, and then in the function @A and @P are bound. This is
+permitted, and they can even be bound to different values. However, when the
+function terminates, the local binding of A simply disappears (because,
+remember, the symbol A is not a member of the list of parameters).
+Only the value bound to P emerges, and is bound to A, which still appears
+unbound at that point.
+
+.SS Definition Syntax
+
+A function definition begins with a @(define ...) directive which must be the
+only element in its line. The define must be followed by a symbol, which is the
+name of the function being defined. After the symbol, there is a parenthesized
+optional argument list. If there is no such list, or if the list is specified
+as () or the symbol "nil" then the function has no parameters. Examples of
+valid define syntax are:
+
+ @(define foo)
+ @(define bar ())
+ @(define match (a b c))
+
+The define directive may be followed directly by the @(end) directive,
+also on a line by itself, in which case the function has an empty body.
+Or it may be followed by one or more query lines and then @(end).
+What is between a @(define ...) and its matching @(end) constitutes the
+function body.
+
+Functions may be nested within function bodies. Such local functions have
+dynamic scope. They are visible in the function body in which they are defined,
+and in any functions invoked from that body.
+
+The body of a function is an anonymous block. (See BLOCKS above).
+
+The following trivial function b produces no bindings and has a body which
+simply matches the line "begin".
+
+ @(define b)
+ begin
+ @(end)
+
+Thus the call:
+
+ @(b)
+
+matches an input line "begin".
+
+.SS Call Syntax
+
+A function is invoked by compound directive whose first symbol is the name of
+that function. Additional elements in the directive are the arguments.
+Arguments may be symbols, or other objects like string and character
+literals.
+
+Example:
+
+ Query: @(define pair (a b))
+ @a @b
+ @(end)
+ @(pair first second)
+ @(pair "ice" cream)
+
+ Data: one two
+ ice milk
+
+ Output: first="one"
+ second="two"
+ cream="milk"
+
+The first call to the function takes the line "one two". The parameter "a"
+takes "one" and parameter b takes "two". These are rebound to the arguments
+first and second. The second call to the function binds the a parameter
+to the word "ice", and the b is unbound, because the
+corresponding argument "cream" is unbound. Thus inside the function, @a
+is forced to match "ice". Then a space is matched and @b collects the text
+"milk". When the function returns, the unbound "cream" variable gets this value.
+
+If a symbol occurs multiple times in the argument list, it constrains
+both parameters to bind to the same value. That is to say, all parameters
+which, in the body of the function, bind a value, and which are all derived
+from the same argument symbol must bind to the same value. This is settled when
+the function terminates, not while it is matching. Example:
+
+ Query: @(define pair (a b))
+ @a @b
+ @(end)
+ @(pair same same)
+
+ Data: one two
+
+ Output: [query fails, prints "false"]
+
+.SS Nested Functions
+
+Function definitions may appear in a function. Such definitions
+are visible in all functions which are invoked from the body
+(and not necessarily enclosed in the body). In other words, the
+scope is dynamic, not lexical. Inner definitions shadow outer
+definitions. This means that a caller can redirect the function
+calls that take place in a callee, by defining local functions
+which capture the references.
+
+Example:
+
+ Query: @(define which)
+ @ (fun)
+ @(end)
+ @(define fun)
+ @ (output)
+ toplevel fun!
+ @ (end)
+ @(end)
+ @(define callee)
+ @ (define fun)
+ @ (output)
+ local fun!
+ @ (end)
+ @ (end)
+ @ (which)
+ @(end)
+ @(callee)
+ @(which)
+
+ Output: local fun!
+ toplevel fun!
+
+Here, the function "which" is defined which calls "fun".
+A toplevel definition of "fun" is introduced which
+outputs "toplevel fun!". Then, within the func
+The function "callee" provides its own local definition
+of "fun" which outputs "local fun!" before calling "which".
+When callee is invoked, it calls @(which), whose @(fun) call is routed to
+callee's local definition. When @(which) is called directly from the top
+level, its @(fun) call goes to the toplevel definition.
+
.SH OUTPUT
A