Re: TXR Pattern Matching/ PEG Parsing

 new new list compose Reply to this message Top page
Attachments:
+ (text/plain)
+ (text/html)

Delete this message
Author: Kaz Kylheku
Date:  
To: Yves Cloutier
CC: txr-users
Subject: Re: TXR Pattern Matching/ PEG Parsing

Hi Yves,

There isn't a sed- or awk-like "text filter" mode in TXR (yet) whereby we could catch certain text by patterns and so something with it, but have the non-matching text "pass through".  The big reason I have not put that in is because it has to work well within the frame work of the pattern matching matchine, which can back-track.

I have been toying with the idea of having a Snobol-like hack whereby a certain variable (say the symbol t) could have the behavior that if we assign to it or try to bind it, it sends the value to standard output instead This would have to be used very carefully, because if it is embedded in a rule which is applied multiple times as the machine scans and back-tracks, duplicate output would be produced. The dilemma is: solve the duplicate output problem, or just leave it like that and have the programmer fend for him or herself.

About the example below and your question, what the parser gives us is a pattern function @(expr) which we can then use wherever we want. (Better versions of this take an argument so we get something out of it other than recognition.)

By the way, speaking about PEGs, I actually had no idea what PEGs are when I started this project! I read about PEGs one day and realized that this is what TXR is doing. :)

If we wanted to filter a file and pass through its text, while recognizing an expression in lines that begin with ".foo:". We could do that with a repeat, say.

@(repeat)
@  (cases)
.foo: @(expr)
@    (bind out "whatever")
@  (or)
@out
@  (end)
@  (do (put-line out))
@(end)

The idea here is that various cases in the repeat all bind a common out variable, which we can spit out.  Various other approaches are possible, of course.

In the web scraping script which pulls the Rosetta examples from the Rosetta site, I actually parse a subset of the Wikimedia markup language, pulled from the HTML.   The markup is extracted in a simple-ish way and captured into a string. Then I do some PEG on that string, rather than directly on the stream.

We can use @(next :list ...) or @(next :string ...) to scan into a list or list of strings, and there is a way to call a pattern function out of TXR Lisp also using match-fun. It's a somewhat clumsy API, but it can be wrapped with functions or macros.  With these approaches we can do things in stages: pull out some coarse-grained area of an input, and then do some fine-grained work on it, so to speak.

On 24.09.2014 17:28, Yves Cloutier wrote:

Hi List,

In the following example:

@(next :args)
@(define os)@/ */@(end)@; os -> optional space
@(define mulop)@(os)@/[*\/]/@(os)@(end)
@(define addop)@(os)@/[+\-]/@(os)@(end)
@(define number)@(os)@/[0-9]+/@(os)@(end)
@(define ident)@(os)@/[A-Za-z]+/@(os)@(end)
@(define factor)@(cases)(@(expr))@(or)@(number)@(or)@(ident)@(end)@(end)
@(define term)@(some)@(factor)@(or)@(factor)@(mulop)@(term)@(or)@(addop)@(factor)@(end)@(end)
@(define expr)@(some)@(term)@(or)@(term)@(addop)@(expr)@(end)@(end)
@(cases)
@  (expr)
@  (output)
parses!
@  (end)
@(or)
@  (expr)@bad
@  (output)
error starting at "@bad"
@  (end)
@(end)

I understand that we are defining a grammar which recognizes, in this case, a mathematical expression.

How would I define a grammar in a similar way but to also take into account "string" text.
that is not part of the grammar, and just let it output as is as it is encountered?

In the above example:

@ (output) parses! @ (end)

outputs hard coded text.  but how would I output text from a stream or input file, without binding it to variables?
thank you! and again thank you for your patience and hand-holding:)