txr - TXR: A data munging language.

	Commit message (Collapse)	Author	Age	Files	Lines
*	tests: implicitly generate empty .expected files.	Kaz Kylheku	2021-04-12	1	-0/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Makefile (%.expected): New implicit rule. Whenever a test requires a .expected file, if it is missing, we create an empty one. This file will be treated as an intermediate by GNU Make, which means that it will be deleted when make terminates. * tests/012/compile.tl: Some of the .tl files no longer have an .expected file, so we have to test for that in the catenating logic. * tests/008/call-2.expected, * tests/008/no-stdin-hang.expected, * tests/011/macros-3.expected, * tests/011/patmatch.expected, * tests/012/aseq.expected, * tests/012/ashwin.expected, * tests/012/compile.tl, * tests/012/cont.expected, * tests/012/defset.expected, * tests/012/ifa.expected, * tests/012/oop-seq.expected, * tests/012/parse.expected, * tests/012/quasi.expected, * tests/012/quine.expected, * tests/012/seq.expected, * tests/012/struct.expected, * tests/012/stslot.expected, * tests/014/dgram-stream.expected, * tests/014/in6addr-str.expected, * tests/014/inaddr-str.expected, * tests/014/socket-basic.expected, * tests/015/awk-fconv.expected, * tests/015/split.expected, * tests/015/trim.expected, * tests/016/arith.expected, * tests/016/ud-arith.expected, * tests/017/ffi-misc.expected, * tests/018/chmod.expected: Empty file deleted.
*	parser: allow funny UTF-8 in regexes and literals.	Kaz Kylheku	2021-04-08	1	-0/+0
	The main idea in this commit is to change a behavior of the lexer, and take advantage of it in the parser. Currently, the lexer recognizes a {UANYN} pattern in two places. That pattern matches a UTF-8 character. The lexeme is passed to the decoder, which is expected to produce exactly one wide character. If the UTF-8 is bad (for instance, a code in the surrogate pair range U+DCxx) then the decoder will produce multiple characters. In that case, these rules return ERRTOK instead of a LITCHAR or REGCHAR. The idea is: why don't we just return those characters as a TEXT token? Then we can just incorporate that into the literal or regex. * parser.l (grammar): If a UANYN lexeme decodes to multiple characters instead of the expected one, then produce a TEXT token instead of complaining about invalid UTF-8 bytes. * parser.y (regterm): Recognize a TEXT item as a regterm, converting its string value to a compound node in the regex AST, so it will be correctly treated as a fixed pattern. (chrlit): If a hash-backslash is followed by a TEXT token, which can happen now, that is invalid; we diagnose that as invalid UTF-8. (quasi_item): Remove TEXT rule, because the litchars constituent not generates TEXT. (litchars, restlistchar): Recognize TEXT item, similarly to regterm. * tests/012/parse.tl: New file. * tests/012/parse.expected: Likewise.