txr.1: Clarified handling of UTF-8, now that it's precise and portable.

author: Kaz Kylheku <kaz@kylheku.com> 2009-11-18 09:28:03 -0800
committer: Kaz Kylheku <kaz@kylheku.com> 2009-11-18 09:28:03 -0800
commit: 1c78271501424b45dd4c676806690595ac7e5410 (patch)
tree: e60220bb53b9832811cb24618ce8b708778045db /txr.1
parent: a8aa4cfd18220daf169d80705eef13df9cb31747 (diff)
download: txr-1c78271501424b45dd4c676806690595ac7e5410.tar.gz
txr-1c78271501424b45dd4c676806690595ac7e5410.tar.bz2
txr-1c78271501424b45dd4c676806690595ac7e5410.zip
1 files changed, 19 insertions, 4 deletions
diff --git a/txr.1 b/txr.1
index 6cd8a61f..c567097f 100644
--- a/txr.1
+++ b/txr.1
@@ -410,13 +410,28 @@ On some platforms, wide characters may be restricted to 16 bits, so that
 can only work with characters in the BMP (Basic Multilingual Plane)
 subset of Unicode.
 
+.B txr
+does not use the localization features of the system library;
+its handling of extended characters is not affected by environment variables
+like LANG and L_CTYPE. The program reads and writes only the UTF-8 encoding.
+
 If
 .B txr
 encounters an invalid bytes in the UTF-8 input, what happens depends on the
-context in which this occurs. Invalid bytes in a query are reported as errors.
-Invalid bytes in data are currently treated in an unspecified way. In
-the future, invalid bytes in data will be mapped to the Unicode codes
-U+DC00 through U+DCFF.
+context in which this occurs. In a query, comments are read without regard
+for encoding, so invalid encoding bytes are not detected. A comment is
+simply a sequence of bytes terminated by a newline.  Invalid
+encoding bytes in signficant query text are diagnosed as syntax errors.
+When the scanner is faced with input that isn't a valid multibyte character, it
+issues an error message, skips one byte, and resumes scanning.
+
+Invalid bytes in data are treated as follows: when an invalid byte is
+encountered in the middle of a multibyte character, or if the input
+ends in the middle of a multibyte character, the UTF-8 decoder returns
+to the starting byte of the ill-formed multibyte character, and decodes just
+that byte, by mapping it to the Unicode character range U+DC00 through U+DCFF.
+The decoding resumes at the following character, expecting that byte to be the
+start of another multibyte character.
 
 .SS Variables
author	Kaz Kylheku <kaz@kylheku.com>	2009-11-18 09:28:03 -0800
committer	Kaz Kylheku <kaz@kylheku.com>	2009-11-18 09:28:03 -0800
commit	1c78271501424b45dd4c676806690595ac7e5410 (patch)
tree	e60220bb53b9832811cb24618ce8b708778045db /txr.1
parent	a8aa4cfd18220daf169d80705eef13df9cb31747 (diff)
download	txr-1c78271501424b45dd4c676806690595ac7e5410.tar.gz txr-1c78271501424b45dd4c676806690595ac7e5410.tar.bz2 txr-1c78271501424b45dd4c676806690595ac7e5410.zip