diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2009-11-12 16:34:27 -0800 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2009-11-12 16:34:27 -0800 |
commit | aa4420347f132039a3e37d6996d1e31096fc10de (patch) | |
tree | cfebd82beda9e272899efae5e5f5dcfb0fc767fd /txr.1 | |
parent | 52501f18487dbefaf0282f1bf1cc328b3fe1ab00 (diff) | |
download | txr-aa4420347f132039a3e37d6996d1e31096fc10de.tar.gz txr-aa4420347f132039a3e37d6996d1e31096fc10de.tar.bz2 txr-aa4420347f132039a3e37d6996d1e31096fc10de.zip |
Documenting extended characters in man page.
Cleaned up some more issues related to extended characters.
Diffstat (limited to 'txr.1')
-rw-r--r-- | txr.1 | 22 |
1 files changed, 22 insertions, 0 deletions
@@ -396,6 +396,28 @@ does not split the line into two; it's embedded into the line and thus cannot match anything. However, @\en may be useful in the @(cat) directive and in @(output). +.SS International Characters + +.B txr +represents text internally using wide characters, which are used to represent +Unicode code points. The query language, as well as all data sources, are +assumed to be in the UTF-8 encoding. In the query language, extended +characters can be used directly in comments, literal text, string literals, +quasiliterals and regular expressions. Extended characters can also be +expressed indirectly using hexadecimal or octal escapes. +On some platforms, wide characters may be restricted to 16 bits, so that +.B txr +can only work with characters in the BMP (Basic Multilingual Plane) +subset of Unicode. + +If +.B txr +encounters an invalid bytes in the UTF-8 input, what happens depends on the +context in which this occurs. Invalid bytes in a query are reported as errors. +Invalid bytes in data are currently treated in an unspecified way. In +the future, invalid bytes in data will be mapped to the Unicode codes +U+DC00 through U+DCFF. + .SS Variables Much of the query syntax consists of arbitrary text, which matches file data |