summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--ChangeLog4
-rw-r--r--txr.128
2 files changed, 20 insertions, 12 deletions
diff --git a/ChangeLog b/ChangeLog
index a0398638..608d20e8 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
2012-02-02 Kaz Kylheku <kaz@kylheku.com>
+ * txr.1: UTF-8 handling clarified.
+
+2012-02-02 Kaz Kylheku <kaz@kylheku.com>
+
* utf8.c (utf8_from_uc, utf8_decode): Impose a minium value on the
decoded character based on which UTF-8 case it is from. This rejects
overlong forms.
diff --git a/txr.1 b/txr.1
index 45a50fda..e5f05cd9 100644
--- a/txr.1
+++ b/txr.1
@@ -477,19 +477,23 @@ If
.B TXR
encounters an invalid bytes in the UTF-8 input, what happens depends on the
context in which this occurs. In a query, comments are read without regard
-for encoding, so invalid encoding bytes are not detected. A comment is
-simply a sequence of bytes terminated by a newline. Invalid
-encoding bytes in significant query text are diagnosed as syntax errors.
-When the scanner is faced with input that isn't a valid multibyte character, it
-issues an error message, skips one byte, and resumes scanning.
-
-Invalid bytes in data are treated as follows: when an invalid byte is
+for encoding, so invalid encoding bytes in comments are not detected. A comment
+is simply a sequence of bytes terminated by a newline. In lexical elements
+which represent text, such as string literals, invalid or unexpected encoding
+bytes are treated as syntax errors. The scanner issues an error message,
+then discards a byte and resumes scanning. Certain sequences pass through the
+scanner without triggering an error, namely some UTF-8 overlong sequences.
+These are caught when when the lexeme is subject to UTF-8 decoding, and treated
+in the same manner as other UTF-8 data, described in the following paragraph.
+
+Invalid bytes in data are treated as follows. When an invalid byte is
encountered in the middle of a multibyte character, or if the input
-ends in the middle of a multibyte character, the UTF-8 decoder returns
-to the starting byte of the ill-formed multibyte character, and decodes just
-that byte, by mapping it to the Unicode character range U+DC00 through U+DCFF.
-The decoding resumes at the following character, expecting that byte to be the
-start of another multibyte character.
+ends in the middle of a multibyte character, or if a character is extracted
+which is encoded as an overlong form, the UTF-8 decoder returns to the starting
+byte of the ill-formed multibyte character, and extracts just that byte,
+mapping it to the Unicode character range U+DC00 through U+DCFF. The decoding
+resumes afresh at the following byte, expecting that byte to be the start
+of a UTF-8 code.
.SS Regular Expression Directives