summaryrefslogtreecommitdiffstats
path: root/cadr.h
diff options
context:
space:
mode:
authorKaz Kylheku <kaz@kylheku.com>2022-05-20 22:11:06 -0700
committerKaz Kylheku <kaz@kylheku.com>2022-05-20 22:11:06 -0700
commit378318ef010dfb15045dfadf242231793d1434de (patch)
treebc8d0d4e8d4f2674b4065f18ddc48af1f31e23d6 /cadr.h
parent0cff857c70c0f770259066d29a720f4404770558 (diff)
downloadtxr-378318ef010dfb15045dfadf242231793d1434de.tar.gz
txr-378318ef010dfb15045dfadf242231793d1434de.tar.bz2
txr-378318ef010dfb15045dfadf242231793d1434de.zip
utf8: bugfix: trailing char fragment ignored.
After "years of trouble-free operation" a bug in the UTF-8 decoder was found, which violates its property that any sequence of bytes will decode to some kind of string, which will encode to the original bytes. When the UTF-8 data prematurely ends in the middle of a valid character, the decoder just drops that data as if it didn't exist. So for instance the two-byte sequence E6 BC should decode to "\xDCE6\xDCBC", since it is a fragment of a three-byte UTF-8 sequence. It actually decodes to the empty string. * utf8.c (utf8_bfom_buffer): When the buffer is exhausted, if we are not in the utf8_init state, it means we were in the middle of a UTF-8 sequence. Walk the bytes from the backtrack point to the end of the buffer and store them into the string as U+DCxx codes. * tests/012/buf.tl: Tests added for this via buf-str, str-buf.
Diffstat (limited to 'cadr.h')
0 files changed, 0 insertions, 0 deletions