|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After "years of trouble-free operation" a bug in the UTF-8
decoder was found, which violates its property that any
sequence of bytes will decode to some kind of string, which
will encode to the original bytes.
When the UTF-8 data prematurely ends in the middle of a valid
character, the decoder just drops that data as if it didn't
exist. So for instance the two-byte sequence E6 BC should
decode to "\xDCE6\xDCBC", since it is a fragment of a three-byte
UTF-8 sequence. It actually decodes to the empty string.
* utf8.c (utf8_bfom_buffer): When the buffer is exhausted, if we are
not in the utf8_init state, it means we were in the middle of a
UTF-8 sequence. Walk the bytes from the backtrack point to the end
of the buffer and store them into the string as U+DCxx codes.
* tests/012/buf.tl: Tests added for this via buf-str, str-buf.
|