From 1c78271501424b45dd4c676806690595ac7e5410 Mon Sep 17 00:00:00 2001 From: Kaz Kylheku Date: Wed, 18 Nov 2009 09:28:03 -0800 Subject: txr.1: Clarified handling of UTF-8, now that it's precise and portable. --- ChangeLog | 4 ++++ txr.1 | 23 +++++++++++++++++++---- 2 files changed, 23 insertions(+), 4 deletions(-) diff --git a/ChangeLog b/ChangeLog index 3098eb75..16d8ddd6 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,7 @@ +2009-11-18 Kaz Kylheku + + txr.1: Clarified handling of UTF-8, now that it's precise and portable. + 2009-11-18 Kaz Kylheku Version 023 diff --git a/txr.1 b/txr.1 index 6cd8a61f..c567097f 100644 --- a/txr.1 +++ b/txr.1 @@ -410,13 +410,28 @@ On some platforms, wide characters may be restricted to 16 bits, so that can only work with characters in the BMP (Basic Multilingual Plane) subset of Unicode. +.B txr +does not use the localization features of the system library; +its handling of extended characters is not affected by environment variables +like LANG and L_CTYPE. The program reads and writes only the UTF-8 encoding. + If .B txr encounters an invalid bytes in the UTF-8 input, what happens depends on the -context in which this occurs. Invalid bytes in a query are reported as errors. -Invalid bytes in data are currently treated in an unspecified way. In -the future, invalid bytes in data will be mapped to the Unicode codes -U+DC00 through U+DCFF. +context in which this occurs. In a query, comments are read without regard +for encoding, so invalid encoding bytes are not detected. A comment is +simply a sequence of bytes terminated by a newline. Invalid +encoding bytes in signficant query text are diagnosed as syntax errors. +When the scanner is faced with input that isn't a valid multibyte character, it +issues an error message, skips one byte, and resumes scanning. + +Invalid bytes in data are treated as follows: when an invalid byte is +encountered in the middle of a multibyte character, or if the input +ends in the middle of a multibyte character, the UTF-8 decoder returns +to the starting byte of the ill-formed multibyte character, and decodes just +that byte, by mapping it to the Unicode character range U+DC00 through U+DCFF. +The decoding resumes at the following character, expecting that byte to be the +start of another multibyte character. .SS Variables -- cgit v1.2.3