From 4b088c75d89e8cbcdc07dec40036fd33995946d3 Mon Sep 17 00:00:00 2001 From: Kaz Kylheku Date: Thu, 8 Apr 2021 17:49:39 -0700 Subject: parser: allow funny UTF-8 in regexes and literals. The main idea in this commit is to change a behavior of the lexer, and take advantage of it in the parser. Currently, the lexer recognizes a {UANYN} pattern in two places. That pattern matches a UTF-8 character. The lexeme is passed to the decoder, which is expected to produce exactly one wide character. If the UTF-8 is bad (for instance, a code in the surrogate pair range U+DCxx) then the decoder will produce multiple characters. In that case, these rules return ERRTOK instead of a LITCHAR or REGCHAR. The idea is: why don't we just return those characters as a TEXT token? Then we can just incorporate that into the literal or regex. * parser.l (grammar): If a UANYN lexeme decodes to multiple characters instead of the expected one, then produce a TEXT token instead of complaining about invalid UTF-8 bytes. * parser.y (regterm): Recognize a TEXT item as a regterm, converting its string value to a compound node in the regex AST, so it will be correctly treated as a fixed pattern. (chrlit): If a hash-backslash is followed by a TEXT token, which can happen now, that is invalid; we diagnose that as invalid UTF-8. (quasi_item): Remove TEXT rule, because the litchars constituent not generates TEXT. (litchars, restlistchar): Recognize TEXT item, similarly to regterm. * tests/012/parse.tl: New file. * tests/012/parse.expected: Likewise. --- tests/012/parse.expected | 0 tests/012/parse.tl | 7 +++++++ 2 files changed, 7 insertions(+) create mode 100644 tests/012/parse.expected create mode 100644 tests/012/parse.tl (limited to 'tests/012') diff --git a/tests/012/parse.expected b/tests/012/parse.expected new file mode 100644 index 00000000..e69de29b diff --git a/tests/012/parse.tl b/tests/012/parse.tl new file mode 100644 index 00000000..8e3e7afc --- /dev/null +++ b/tests/012/parse.tl @@ -0,0 +1,7 @@ +(load "../common") + +(test (read `"@(str-buf #b'EDB081')"`) + "\xDCED\xDCB0\xDC81") + +(test (regex-parse (str-buf #b'EDB081')) + (compound "\xDCED\xDCB0\xDC81")) -- cgit v1.2.3