From 928fb0df45be6cb12a63e9d1d43504be1a595f7d Mon Sep 17 00:00:00 2001
From: Kaz Kylheku <kaz@kylheku.com>
Date: Thu, 2 Feb 2012 17:28:15 -0800
Subject: * txr.1: UTF-8 handling clarified.

---
 ChangeLog |  4 ++++
 txr.1     | 28 ++++++++++++++++------------
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index a0398638..608d20e8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,7 @@
+2012-02-02  Kaz Kylheku  <kaz@kylheku.com>
+
+	* txr.1: UTF-8 handling clarified.
+
 2012-02-02  Kaz Kylheku  <kaz@kylheku.com>
 
 	* utf8.c (utf8_from_uc, utf8_decode): Impose a minium value on the
diff --git a/txr.1 b/txr.1
index 45a50fda..e5f05cd9 100644
--- a/txr.1
+++ b/txr.1
@@ -477,19 +477,23 @@ If
 .B TXR
 encounters an invalid bytes in the UTF-8 input, what happens depends on the
 context in which this occurs. In a query, comments are read without regard
-for encoding, so invalid encoding bytes are not detected. A comment is
-simply a sequence of bytes terminated by a newline.  Invalid
-encoding bytes in significant query text are diagnosed as syntax errors.
-When the scanner is faced with input that isn't a valid multibyte character, it
-issues an error message, skips one byte, and resumes scanning.
-
-Invalid bytes in data are treated as follows: when an invalid byte is
+for encoding, so invalid encoding bytes in comments are not detected. A comment
+is simply a sequence of bytes terminated by a newline.  In lexical elements
+which represent text, such as string literals, invalid or unexpected encoding
+bytes are treated as syntax errors. The scanner issues an error message,
+then discards a byte and resumes scanning.  Certain sequences pass through the
+scanner without triggering an error, namely some UTF-8 overlong sequences.
+These are caught when when the lexeme is subject to UTF-8 decoding, and treated
+in the same manner as other UTF-8 data, described in the following paragraph.
+
+Invalid bytes in data are treated as follows. When an invalid byte is
 encountered in the middle of a multibyte character, or if the input
-ends in the middle of a multibyte character, the UTF-8 decoder returns
-to the starting byte of the ill-formed multibyte character, and decodes just
-that byte, by mapping it to the Unicode character range U+DC00 through U+DCFF.
-The decoding resumes at the following character, expecting that byte to be the
-start of another multibyte character.
+ends in the middle of a multibyte character, or if a character is extracted
+which is encoded as an overlong form, the UTF-8 decoder returns to the starting
+byte of the ill-formed multibyte character, and extracts just that byte,
+mapping it to the Unicode character range U+DC00 through U+DCFF.  The decoding
+resumes afresh at the following byte, expecting that byte to be the start
+of a UTF-8 code.
 
 .SS Regular Expression Directives
 
-- 
cgit v1.2.3