parser: allow funny UTF-8 in regexes and literals.

The main idea in this commit is to change a behavior of the lexer, and take advantage of it in the parser. Currently, the lexer recognizes a {UANYN} pattern in two places. That pattern matches a UTF-8 character. The lexeme is passed to the decoder, which is expected to produce exactly one wide character. If the UTF-8 is bad (for instance, a code in the surrogate pair range U+DCxx) then the decoder will produce multiple characters. In that case, these rules return ERRTOK instead of a LITCHAR or REGCHAR. The idea is: why don't we just return those characters as a TEXT token? Then we can just incorporate that into the literal or regex. * parser.l (grammar): If a UANYN lexeme decodes to multiple characters instead of the expected one, then produce a TEXT token instead of complaining about invalid UTF-8 bytes. * parser.y (regterm): Recognize a TEXT item as a regterm, converting its string value to a compound node in the regex AST, so it will be correctly treated as a fixed pattern. (chrlit): If a hash-backslash is followed by a TEXT token, which can happen now, that is invalid; we diagnose that as invalid UTF-8. (quasi_item): Remove TEXT rule, because the litchars constituent not generates TEXT. (litchars, restlistchar): Recognize TEXT item, similarly to regterm. * tests/012/parse.tl: New file. * tests/012/parse.expected: Likewise.
author: Kaz Kylheku <kaz@kylheku.com> 2021-04-08 17:49:39 -0700
committer: Kaz Kylheku <kaz@kylheku.com> 2021-04-08 17:49:39 -0700
commit: 4b088c75d89e8cbcdc07dec40036fd33995946d3 (patch)
tree: f749fee637ea544c3f404a14e7294099968a4dbb /tests
parent: cea5c956486b8acae4bf5a23f0148d6b85d9acd3 (diff)
download: txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.tar.gz
txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.tar.bz2
txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.zip
2 files changed, 7 insertions, 0 deletions
diff --git a/tests/012/parse.expected b/tests/012/parse.expected
new file mode 100644
index 00000000..e69de29b
--- /dev/null
+++ b/tests/012/parse.expected
diff --git a/tests/012/parse.tl b/tests/012/parse.tl
new file mode 100644
index 00000000..8e3e7afc
--- /dev/null
+++ b/tests/012/parse.tl
@@ -0,0 +1,7 @@
+(load "../common")
+
+(test (read `"@(str-buf #b'EDB081')"`)
+      "\xDCED\xDCB0\xDC81")
+
+(test (regex-parse (str-buf #b'EDB081'))
+      (compound "\xDCED\xDCB0\xDC81"))
author	Kaz Kylheku <kaz@kylheku.com>	2021-04-08 17:49:39 -0700
committer	Kaz Kylheku <kaz@kylheku.com>	2021-04-08 17:49:39 -0700
commit	4b088c75d89e8cbcdc07dec40036fd33995946d3 (patch)
tree	f749fee637ea544c3f404a14e7294099968a4dbb /tests
parent	cea5c956486b8acae4bf5a23f0148d6b85d9acd3 (diff)
download	txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.tar.gz txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.tar.bz2 txr-4b088c75d89e8cbcdc07dec40036fd33995946d3.zip