txr - TXR: A data munging language.

	Commit message (Collapse)	Author	Age	Files	Lines
*	tests: disable some UTF-8 tests on 16 bit wchar_t.	Kaz Kylheku	2021-04-20	1	-8/+9
\| \| \| \| \|	* tests/012/parse.tl: All the tests in this file blow up on systems that don't have a full-blown character type.
*	parser: allow non-UTF-8 bytes in literals and regexes.	Kaz Kylheku	2021-04-08	1	-0/+6
\| \| \| \| \| \| \| \| \| \|	* parser.l (grammar): Just like we do in SREGEX, allow an arbitrary byte in REGEX, mapping it to the DCxx range. Do the same inside string literals of all types. * lex.yy.c.shipped: Updated. * tests/012/parse.tl: New tests.
*	parser: allow funny UTF-8 in regexes and literals.	Kaz Kylheku	2021-04-08	1	-0/+7
	The main idea in this commit is to change a behavior of the lexer, and take advantage of it in the parser. Currently, the lexer recognizes a {UANYN} pattern in two places. That pattern matches a UTF-8 character. The lexeme is passed to the decoder, which is expected to produce exactly one wide character. If the UTF-8 is bad (for instance, a code in the surrogate pair range U+DCxx) then the decoder will produce multiple characters. In that case, these rules return ERRTOK instead of a LITCHAR or REGCHAR. The idea is: why don't we just return those characters as a TEXT token? Then we can just incorporate that into the literal or regex. * parser.l (grammar): If a UANYN lexeme decodes to multiple characters instead of the expected one, then produce a TEXT token instead of complaining about invalid UTF-8 bytes. * parser.y (regterm): Recognize a TEXT item as a regterm, converting its string value to a compound node in the regex AST, so it will be correctly treated as a fixed pattern. (chrlit): If a hash-backslash is followed by a TEXT token, which can happen now, that is invalid; we diagnose that as invalid UTF-8. (quasi_item): Remove TEXT rule, because the litchars constituent not generates TEXT. (litchars, restlistchar): Recognize TEXT item, similarly to regterm. * tests/012/parse.tl: New file. * tests/012/parse.expected: Likewise.