diff options
Diffstat (limited to 'txr.1')
-rw-r--r-- | txr.1 | 319 |
1 files changed, 300 insertions, 19 deletions
@@ -21,7 +21,7 @@ .\"IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED .\"WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. -.TH txr 1 2009-09-09 "txr v. 013" "Text Extraction Utility" +.TH txr 1 2009-09-09 "txr v. 014" "Text Extraction Utility" .SH NAME txr \- text extractor .SH SYNOPSIS @@ -601,8 +601,9 @@ The general syntax of a directive is: @EXPR where expr is a parenthesized list of subexpressions. A subexpression -is an symbol, number, regular expression, or a parenthesized expression. -So, examples of valid directives are: +is an symbol, number, string literal, character literal, regular expression, or +a parenthesized expression. So, examples of syntactically valid directives +are: @(banana) @@ -610,11 +611,18 @@ So, examples of valid directives are: @( a (b (c d) (e ) )) + @("apple" 'b' 3) + @(a /[a-z]*/ b) A symbol is lexically the same thing as a variable and the same rules apply. Tokens that look like numbers are treated as numbers. +String and character literals are delimited by double and single quotes, +respectively, and may not span multiple lines. Character literals must contain +exactly one character. Character and numeric escapes may be used within +literals to escape the quotes, and to denote control characters. + Some directives are involved in structuring the overall syntax of the query. There are syntactic constraints that depend on the directive. For instance the @@ -630,7 +638,7 @@ Continue matching in another file. .IP @(block) The remaining query is treated as an anonymous or named block. Blocks may be referenced by @(accept) and @(fail) directives. -Blocks are discussed in the section Blocks below. +Blocks are discussed in the section BLOCKS below. .IP @(skip) Treat the remaining query as a subquery unit, and search the lines of @@ -651,7 +659,15 @@ Match some clauses in parallel. Each one must match. Match some clauses in parallel. None must match. .IP @(maybe) -Match some clauses in parallel. None must match. +Match some clauses in parallel, which may or may not match. +No failure occurs if none match. + +.IP @(cases) +Match some clauses sequentially, stopping if one of them +matches successfully. + +.IP @(define NAME ( ARGUMENTS ...)) +Introduces a function. Functions are discussed in the FUNCTIONS section below. .IP @(collect) Search the data for multiple matches of a clause. Collect the @@ -671,17 +687,17 @@ Separator of clauses for @(some), @(all), and @(none). Equivalent to @(and). Choice is stylistic. .IP @(end) -Required terminator for @(some), @(all), @(none), @(maybe), @(collect), -@(output), and @(repeat). +Required terminator for @(some), @(all), @(none), @(maybe), @(cases), +@(collect), @(output), and @(repeat). .IP @(fail) Terminate the processing of a block, as if it were a failed match. -Blocks are discussed in the section Blocks below. +Blocks are discussed in the section BLOCKS below. .IP @(accept) Terminate the processing of a block, as if it were a successful match. What bindings emerge may depend on the kind of block: collect -has special semantics. Blocks are discussed in the section Blocks below. +has special semantics. Blocks are discussed in the section BLOCKS below. .IP @(flatten) Normalizes a set of specified variables to one-dimensional lists. Those @@ -890,7 +906,7 @@ line "a dark". The @(some) clause combines the text line "it", and a @(none) clause which contains just one clause consisting of the line "was". -The semantics of the some, all, none and maybe directives is: +The semantics of the some, all, none, maybe and cases directives is: .IP @(all) Each of the clauses is matched at the current position. If any of the @@ -914,12 +930,20 @@ The directive succeeds even if all of the clauses fail. Whatever bindings are found in any of the clauses are retained. -When a @(some) or @(all) directive matches successfully, or a @(maybe) -directive matches something, the query advances by the greatest number of lines -matched in any of the subclauses. For instance if there are two subclauses, and -one of them matches three lines, but the other one matches five lines, then the -overall clause is considered to have made a five line match at its position. If -more directives follow, they begin matching five lines down from that position. +.IP @(cases) +The clauses are matched, in order, at the current position. +If any clause matches, the matching stops and the bindings +collected from that clause are retained. Any remaining clauses +after that one are not processed. If no clause matches, the +directive fails, and produces no bindings. + +When a @(some), @(all), or @(cases) directive matches successfully, or a +@(maybe) directive matches in at least one of its clauses, the query advances +by the greatest number of lines matched in any of the subclauses. For instance +if there are two subclauses, and one of them matches three lines, but the other +one matches five lines, then the overall clause is considered to have made a +five line match at its position. If more directives follow, they begin matching +five lines down from that position. .SS The Collect Directive @@ -1253,6 +1277,15 @@ to match B, or the bind fails. Matching means that either - A and B are lists and are either identical, or one is found as substructure within the other. +The right hand side does not have to be a variable. It may be some other +object, like a string, or list of strings, et cetera. For instance + + @(bind A "ab\tc") + +will bind the string "ab\tc" (the letter a, b, a tab character, and c) +to the variable A if A is unbound. If A is bound, this will fail unless +A already contains an identical string. + The left hand side of a bind can be a nested list pattern containing variables. The last item of a list at any nesting level can be preceded by a dot, which means that the variable matches the rest of the list from that position. @@ -1280,7 +1313,8 @@ The @(block NAME) directive introduces a named block, except when the name is the word nil. The @(block) directive introduces an unnamed block, equivalent to @(block nil). -The @(skip) and @(collect) directives introduce implicit anonymous blocks. +The @(skip) and @(collect) directives introduce implicit anonymous blocks, +as do function bodies. .SS Block Scope @@ -1384,12 +1418,12 @@ that block until this point emerge from that block. .IP @(accept) Immediately terminate the innermost enclosing anonymous block, as if -that block successfully mached. Any bindings established within +that block successfully matched. Any bindings established within that block until this point emerge from that block. If the implicit block introduced by @(skip) is terminated in this manner, this has the effect of causing the skip itself to succeed, as if -all of the trailing material succesfully matched. +all of the trailing material successfully matched. If the implicit block associated with a @(collect) is terminated this way, then the collection stops. All bindings collected in the current iteration of @@ -1544,6 +1578,253 @@ The second clause grabs four lines, which is the longest match. And so, the next line of input available for matching is 5, which goes to the @second variable. +.SH FUNCTIONS + +.SS Introduction + +.B txr +functions allow a query to be structured to avoid repetition. +On a theoretical note, because +.B txr +functions support recursion, functions enable txr to match some +kinds of patterns which exhibit self-embedding, or nesting, +and thus cannot be matched by a regular language. + +Functions in +.B txr +are not exactly like functions in mathematics or functional languages, and are +not like procedures in imperative programming languages. They are not exactly +like macros either. What it means for a +.B txr +function to take arguments and produce a result is different from +the conventional notion of a function. + +A +.B txr +function may have one or more parameters. When such a function is invoked, an +argument must be specified for each parameter. However, a special behavior is +at play here. Namely, some or all of the argument expressions may be unbound +variables. In that case, the corresponding parameters behave like unbound +variables also. Thus +.B txr +function calls can transmit the "unbound" state from argument to parameter. + +It should be mentioned that functions have access to all bindings that are +visible in the caller; functions may refer to variables which are not +mentioned in their parameter list. + +With regard to returning, +.B txr +functions are also unconventional. If the function fails, then the function +call is considered to have failed. The function call behaves like a kind of +match; if the function fails, then the call is like a failed match. + +When a function call succeeds, then the bindings emanating from that function +are processed specially. Firstly, any bindings for variables which do not +correspond to one of the function's parameters are thrown away. Functions may +internally bind arbitrary variables in order to get their job done, but only +those variables which are named in the function argument list may propagate out +of the function call. Thus, a function with no arguments can only indicate +matching success or failure, but not produce any bindings. Secondly, +variables do not propagate out of the function directly, but undergo +a renaming. For each parameter which went into the function as an unbound +variable (because its corresponding argument was an unbound variable), +if that parameter now has a value, that value is bound onto the corresponding +argument. + +Example: + + @(define collect_words (list)) + @(coll)@{list /[^ \t]/}@(end) + @(end) + +The above function "collect_words" contains a query which collects words from a +line (sequences of characters other than space or tab), into the list variable +called "list". This variable is named in the parameter list of the function, +therefore, its value, if it has one, is permitted to escape from the function +call. + +Suppose the input data is: + + Fine summer day + +and the function is called like this: + + @(collect_words wordlist) + +The result is: + + wordlist[0]=Fine + wordlist[1]=summer + wordlist[1]=day + +How it works is that in the function call @(collect_words wordlist), +"wordlist" is an unbound variable. The parameter corresponding to that +unbound variable is the parameter "list". Therefore, that parameter +is unbound over the body of the function. The function body collects the +words of "Fine summer day" into the variable "list", and then +yields the that binding. Then the function call completes by +noticing that the function parameter "list" now has a binding, and +that the corresponding argument "wordlist" has no binding. The binding +is thus transferred to the "wordlist" variable. After that, the +bindings produced by the function are thrown away. The only enduring +effects are: + +.IP - +the function matched and consumed some input; and + +.IP - +the function succeeded; and + +.IP - +the wordlist variable now has a binding. +.PP + +Another way to understand the parameter behavior is that function +parameters behave like proxies which represent their arguments. If an argument +is an established value, such as a character string or bound variable, the +parameter is a proxy for that value and behaves just like that value. If an +argument is an unbound variable, the function parameter acts as a proxy +representing that unbound variable. The effect of binding the proxy is +that the variable becomes bound, an effect which is settled when the +function goes out of scope. + +Within the function, both the original variable and the proxy are +visible simultaneously, and are independent. What if a function binds both of +them? Suppose a function has a parameter called P, which is called +with an argument A, and then in the function @A and @P are bound. This is +permitted, and they can even be bound to different values. However, when the +function terminates, the local binding of A simply disappears (because, +remember, the symbol A is not a member of the list of parameters). +Only the value bound to P emerges, and is bound to A, which still appears +unbound at that point. + +.SS Definition Syntax + +A function definition begins with a @(define ...) directive which must be the +only element in its line. The define must be followed by a symbol, which is the +name of the function being defined. After the symbol, there is a parenthesized +optional argument list. If there is no such list, or if the list is specified +as () or the symbol "nil" then the function has no parameters. Examples of +valid define syntax are: + + @(define foo) + @(define bar ()) + @(define match (a b c)) + +The define directive may be followed directly by the @(end) directive, +also on a line by itself, in which case the function has an empty body. +Or it may be followed by one or more query lines and then @(end). +What is between a @(define ...) and its matching @(end) constitutes the +function body. + +Functions may be nested within function bodies. Such local functions have +dynamic scope. They are visible in the function body in which they are defined, +and in any functions invoked from that body. + +The body of a function is an anonymous block. (See BLOCKS above). + +The following trivial function b produces no bindings and has a body which +simply matches the line "begin". + + @(define b) + begin + @(end) + +Thus the call: + + @(b) + +matches an input line "begin". + +.SS Call Syntax + +A function is invoked by compound directive whose first symbol is the name of +that function. Additional elements in the directive are the arguments. +Arguments may be symbols, or other objects like string and character +literals. + +Example: + + Query: @(define pair (a b)) + @a @b + @(end) + @(pair first second) + @(pair "ice" cream) + + Data: one two + ice milk + + Output: first="one" + second="two" + cream="milk" + +The first call to the function takes the line "one two". The parameter "a" +takes "one" and parameter b takes "two". These are rebound to the arguments +first and second. The second call to the function binds the a parameter +to the word "ice", and the b is unbound, because the +corresponding argument "cream" is unbound. Thus inside the function, @a +is forced to match "ice". Then a space is matched and @b collects the text +"milk". When the function returns, the unbound "cream" variable gets this value. + +If a symbol occurs multiple times in the argument list, it constrains +both parameters to bind to the same value. That is to say, all parameters +which, in the body of the function, bind a value, and which are all derived +from the same argument symbol must bind to the same value. This is settled when +the function terminates, not while it is matching. Example: + + Query: @(define pair (a b)) + @a @b + @(end) + @(pair same same) + + Data: one two + + Output: [query fails, prints "false"] + +.SS Nested Functions + +Function definitions may appear in a function. Such definitions +are visible in all functions which are invoked from the body +(and not necessarily enclosed in the body). In other words, the +scope is dynamic, not lexical. Inner definitions shadow outer +definitions. This means that a caller can redirect the function +calls that take place in a callee, by defining local functions +which capture the references. + +Example: + + Query: @(define which) + @ (fun) + @(end) + @(define fun) + @ (output) + toplevel fun! + @ (end) + @(end) + @(define callee) + @ (define fun) + @ (output) + local fun! + @ (end) + @ (end) + @ (which) + @(end) + @(callee) + @(which) + + Output: local fun! + toplevel fun! + +Here, the function "which" is defined which calls "fun". +A toplevel definition of "fun" is introduced which +outputs "toplevel fun!". Then, within the func +The function "callee" provides its own local definition +of "fun" which outputs "local fun!" before calling "which". +When callee is invoked, it calls @(which), whose @(fun) call is routed to +callee's local definition. When @(which) is called directly from the top +level, its @(fun) call goes to the toplevel definition. + .SH OUTPUT A |