|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Makefile (%.expected): New implicit rule. Whenever a test requires a
.expected file, if it is missing, we create an empty one.
This file will be treated as an intermediate by GNU Make, which means
that it will be deleted when make terminates.
* tests/012/compile.tl: Some of the .tl files no longer have
an .expected file, so we have to test for that in the
catenating logic.
* tests/008/call-2.expected,
* tests/008/no-stdin-hang.expected,
* tests/011/macros-3.expected,
* tests/011/patmatch.expected,
* tests/012/aseq.expected,
* tests/012/ashwin.expected,
* tests/012/compile.tl,
* tests/012/cont.expected,
* tests/012/defset.expected,
* tests/012/ifa.expected,
* tests/012/oop-seq.expected,
* tests/012/parse.expected,
* tests/012/quasi.expected,
* tests/012/quine.expected,
* tests/012/seq.expected,
* tests/012/struct.expected,
* tests/012/stslot.expected,
* tests/014/dgram-stream.expected,
* tests/014/in6addr-str.expected,
* tests/014/inaddr-str.expected,
* tests/014/socket-basic.expected,
* tests/015/awk-fconv.expected,
* tests/015/split.expected,
* tests/015/trim.expected,
* tests/016/arith.expected,
* tests/016/ud-arith.expected,
* tests/017/ffi-misc.expected,
* tests/018/chmod.expected: Empty file deleted.
|
|
The main idea in this commit is to change a behavior of the
lexer, and take advantage of it in the parser. Currently, the
lexer recognizes a {UANYN} pattern in two places. That
pattern matches a UTF-8 character. The lexeme is passed to
the decoder, which is expected to produce exactly one wide
character. If the UTF-8 is bad (for instance, a code in the
surrogate pair range U+DCxx) then the decoder will produce
multiple characters. In that case, these rules return ERRTOK
instead of a LITCHAR or REGCHAR. The idea is: why don't we
just return those characters as a TEXT token? Then we can
just incorporate that into the literal or regex.
* parser.l (grammar): If a UANYN lexeme decodes to multiple
characters instead of the expected one, then produce a
TEXT token instead of complaining about invalid UTF-8 bytes.
* parser.y (regterm): Recognize a TEXT item as a regterm,
converting its string value to a compound node in the regex
AST, so it will be correctly treated as a fixed pattern.
(chrlit): If a hash-backslash is followed by a TEXT token,
which can happen now, that is invalid; we diagnose that
as invalid UTF-8.
(quasi_item): Remove TEXT rule, because the litchars
constituent not generates TEXT.
(litchars, restlistchar): Recognize TEXT item, similarly to
regterm.
* tests/012/parse.tl: New file.
* tests/012/parse.expected: Likewise.
|