diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2012-02-02 17:28:15 -0800 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2012-02-02 17:28:15 -0800 |
commit | 928fb0df45be6cb12a63e9d1d43504be1a595f7d (patch) | |
tree | eeb8911283d8a9fd299bcd7c12df44ad1eb4ea3e | |
parent | 97a34f6e5b04d4ce2eb3ee63f42d1375f4939de3 (diff) | |
download | txr-928fb0df45be6cb12a63e9d1d43504be1a595f7d.tar.gz txr-928fb0df45be6cb12a63e9d1d43504be1a595f7d.tar.bz2 txr-928fb0df45be6cb12a63e9d1d43504be1a595f7d.zip |
* txr.1: UTF-8 handling clarified.
-rw-r--r-- | ChangeLog | 4 | ||||
-rw-r--r-- | txr.1 | 28 |
2 files changed, 20 insertions, 12 deletions
@@ -1,5 +1,9 @@ 2012-02-02 Kaz Kylheku <kaz@kylheku.com> + * txr.1: UTF-8 handling clarified. + +2012-02-02 Kaz Kylheku <kaz@kylheku.com> + * utf8.c (utf8_from_uc, utf8_decode): Impose a minium value on the decoded character based on which UTF-8 case it is from. This rejects overlong forms. @@ -477,19 +477,23 @@ If .B TXR encounters an invalid bytes in the UTF-8 input, what happens depends on the context in which this occurs. In a query, comments are read without regard -for encoding, so invalid encoding bytes are not detected. A comment is -simply a sequence of bytes terminated by a newline. Invalid -encoding bytes in significant query text are diagnosed as syntax errors. -When the scanner is faced with input that isn't a valid multibyte character, it -issues an error message, skips one byte, and resumes scanning. - -Invalid bytes in data are treated as follows: when an invalid byte is +for encoding, so invalid encoding bytes in comments are not detected. A comment +is simply a sequence of bytes terminated by a newline. In lexical elements +which represent text, such as string literals, invalid or unexpected encoding +bytes are treated as syntax errors. The scanner issues an error message, +then discards a byte and resumes scanning. Certain sequences pass through the +scanner without triggering an error, namely some UTF-8 overlong sequences. +These are caught when when the lexeme is subject to UTF-8 decoding, and treated +in the same manner as other UTF-8 data, described in the following paragraph. + +Invalid bytes in data are treated as follows. When an invalid byte is encountered in the middle of a multibyte character, or if the input -ends in the middle of a multibyte character, the UTF-8 decoder returns -to the starting byte of the ill-formed multibyte character, and decodes just -that byte, by mapping it to the Unicode character range U+DC00 through U+DCFF. -The decoding resumes at the following character, expecting that byte to be the -start of another multibyte character. +ends in the middle of a multibyte character, or if a character is extracted +which is encoded as an overlong form, the UTF-8 decoder returns to the starting +byte of the ill-formed multibyte character, and extracts just that byte, +mapping it to the Unicode character range U+DC00 through U+DCFF. The decoding +resumes afresh at the following byte, expecting that byte to be the start +of a UTF-8 code. .SS Regular Expression Directives |