From 378318ef010dfb15045dfadf242231793d1434de Mon Sep 17 00:00:00 2001
From: Kaz Kylheku <kaz@kylheku.com>
Date: Fri, 20 May 2022 22:11:06 -0700
Subject: utf8: bugfix: trailing char fragment ignored.

After "years of trouble-free operation" a bug in the UTF-8
decoder was found, which violates its property that any
sequence of bytes will decode to some kind of string, which
will encode to the original bytes.

When the UTF-8 data prematurely ends in the middle of a valid
character, the decoder just drops that data as if it didn't
exist. So for instance the two-byte sequence E6 BC should
decode to "\xDCE6\xDCBC", since it is a fragment of a three-byte
UTF-8 sequence. It actually decodes to the empty string.

* utf8.c (utf8_bfom_buffer): When the buffer is exhausted, if we are
not in the utf8_init state, it means we were in the middle of a
UTF-8 sequence. Walk the bytes from the backtrack point to the end
of the buffer and store them into the string as U+DCxx codes.

* tests/012/buf.tl: Tests added for this via buf-str, str-buf.
---
 tests/012/buf.tl | 6 ++++++
 1 file changed, 6 insertions(+)

(limited to 'tests')

diff --git a/tests/012/buf.tl b/tests/012/buf.tl
index 1c8040d6..8f494264 100644
--- a/tests/012/buf.tl
+++ b/tests/012/buf.tl
@@ -2,3 +2,9 @@
 
 (vtest (uint-buf (make-buf 8 255 16)) (pred (expt 2 64)))
 (test (int-buf (make-buf 8 255 16)) -1)
+
+(mtest
+  (str-buf #b'E6BC') "\xDCE6\xDCBC"
+  (buf-str "\xDCE6\xDCBC") #b'E6BC'
+  (str-buf #b'E6') "\xDCE6"
+  (buf-str "\xDCE6") #b'E6')
-- 
cgit v1.2.3