diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2014-02-15 00:19:15 -0800 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2014-02-15 00:19:15 -0800 |
commit | 3b64319b10196425401d4d71f7ee1273e3bffe32 (patch) | |
tree | 5197904de12a1b7a3d601fa468f4ab3514e0de2e /txr.1 | |
parent | 48fbe97484faad462a1fc52049d682fdaaa665a0 (diff) | |
download | txr-3b64319b10196425401d4d71f7ee1273e3bffe32.tar.gz txr-3b64319b10196425401d4d71f7ee1273e3bffe32.tar.bz2 txr-3b64319b10196425401d4d71f7ee1273e3bffe32.zip |
A trivial change in the UTF-8 decoder allows TXR to handle null bytes
in text.
* utf8.h (UTF8_ADMIT_NUL): New preprocessor symbol.
(utf8_decoder): New member, flags.
* utf8.c (utf8_decoder_init): Initialize flags to 0.
(utf8_decode): If a null byte is encountered in the input,
then convert it to 0xDC00, rather than keeping it as zero,
unless flags contains UTF8_ADMIT_NUL.
* txr.1: Document handling of null bytes.
Diffstat (limited to 'txr.1')
-rw-r--r-- | txr.1 | 9 |
1 files changed, 8 insertions, 1 deletions
@@ -478,7 +478,7 @@ does not split the line into two; it's embedded into the line and thus cannot match anything. However, @\en may be useful in the @(cat) directive and in @(output). -.SS International Characters +.SS Character Handling and International Characters .B TXR represents text internally using wide characters, which are used to represent @@ -519,6 +519,13 @@ mapping it to the Unicode character range U+DC00 through U+DCFF. The decoding resumes afresh at the following byte, expecting that byte to be the start of a UTF-8 code. +Furthermore, because TXR internally uses a null-terminated character +representation of strings which easily interoperates with C language +interfaces, when a null character is read from a stream, TXR converts it to +the code U+DC00. On output, this code converts back to a null byte, +as explained in the previous paragraph. By means of this representational +trick, TXR can handle textual data containing null bytes. + .SS Regular Expression Directives In place of a piece of text (see section Text above), a regular expression |