From 3b64319b10196425401d4d71f7ee1273e3bffe32 Mon Sep 17 00:00:00 2001
From: Kaz Kylheku <kaz@kylheku.com>
Date: Sat, 15 Feb 2014 00:19:15 -0800
Subject: A trivial change in the UTF-8 decoder allows TXR to handle null bytes
 in text.

* utf8.h (UTF8_ADMIT_NUL): New preprocessor symbol.
(utf8_decoder): New member, flags.

* utf8.c (utf8_decoder_init): Initialize flags to 0.
(utf8_decode): If a null byte is encountered in the input,
then convert it to 0xDC00, rather than keeping it as zero,
unless flags contains UTF8_ADMIT_NUL.

* txr.1: Document handling of null bytes.
---
 txr.1 | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

(limited to 'txr.1')

diff --git a/txr.1 b/txr.1
index dc692dd2..d69b8645 100644
--- a/txr.1
+++ b/txr.1
@@ -478,7 +478,7 @@ does not split the line into two; it's embedded into the line and
 thus cannot match anything. However, @\en may be useful in the @(cat)
 directive and in @(output).
 
-.SS International Characters
+.SS Character Handling and International Characters
 
 .B TXR
 represents text internally using wide characters, which are used to represent
@@ -519,6 +519,13 @@ mapping it to the Unicode character range U+DC00 through U+DCFF.  The decoding
 resumes afresh at the following byte, expecting that byte to be the start
 of a UTF-8 code.
 
+Furthermore, because TXR internally uses a null-terminated character
+representation of strings which easily interoperates with C language
+interfaces, when a null character is read from a stream, TXR converts it to
+the code U+DC00. On output, this code converts back to a null byte,
+as explained in the previous paragraph. By means of this representational
+trick, TXR can handle textual data containing null bytes.
+
 .SS Regular Expression Directives
 
 In place of a piece of text (see section Text above), a regular expression
-- 
cgit v1.2.3