From 3b64319b10196425401d4d71f7ee1273e3bffe32 Mon Sep 17 00:00:00 2001 From: Kaz Kylheku Date: Sat, 15 Feb 2014 00:19:15 -0800 Subject: A trivial change in the UTF-8 decoder allows TXR to handle null bytes in text. * utf8.h (UTF8_ADMIT_NUL): New preprocessor symbol. (utf8_decoder): New member, flags. * utf8.c (utf8_decoder_init): Initialize flags to 0. (utf8_decode): If a null byte is encountered in the input, then convert it to 0xDC00, rather than keeping it as zero, unless flags contains UTF8_ADMIT_NUL. * txr.1: Document handling of null bytes. --- txr.1 | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) (limited to 'txr.1') diff --git a/txr.1 b/txr.1 index dc692dd2..d69b8645 100644 --- a/txr.1 +++ b/txr.1 @@ -478,7 +478,7 @@ does not split the line into two; it's embedded into the line and thus cannot match anything. However, @\en may be useful in the @(cat) directive and in @(output). -.SS International Characters +.SS Character Handling and International Characters .B TXR represents text internally using wide characters, which are used to represent @@ -519,6 +519,13 @@ mapping it to the Unicode character range U+DC00 through U+DCFF. The decoding resumes afresh at the following byte, expecting that byte to be the start of a UTF-8 code. +Furthermore, because TXR internally uses a null-terminated character +representation of strings which easily interoperates with C language +interfaces, when a null character is read from a stream, TXR converts it to +the code U+DC00. On output, this code converts back to a null byte, +as explained in the previous paragraph. By means of this representational +trick, TXR can handle textual data containing null bytes. + .SS Regular Expression Directives In place of a piece of text (see section Text above), a regular expression -- cgit v1.2.3