summaryrefslogtreecommitdiffstats
path: root/HACKING
diff options
context:
space:
mode:
authorKaz Kylheku <kaz@kylheku.com>2011-10-10 12:09:27 -0700
committerKaz Kylheku <kaz@kylheku.com>2011-10-10 12:09:27 -0700
commitf2140e4f219bdc37cb3ce1ff143d81965f0e29ef (patch)
treeaa42e31ca66b79addf0e797e8e723cacf30f5763 /HACKING
parent76fd394d4f7667f054138bd4f994a96d60933436 (diff)
downloadtxr-f2140e4f219bdc37cb3ce1ff143d81965f0e29ef.tar.gz
txr-f2140e4f219bdc37cb3ce1ff143d81965f0e29ef.tar.bz2
txr-f2140e4f219bdc37cb3ce1ff143d81965f0e29ef.zip
* HACKING: Documented portability hacks for narrow wchar_t.
Diffstat (limited to 'HACKING')
-rw-r--r--HACKING60
1 files changed, 59 insertions, 1 deletions
diff --git a/HACKING b/HACKING
index e925566b..69c29927 100644
--- a/HACKING
+++ b/HACKING
@@ -17,6 +17,7 @@ CONTENTS:
2.3 The COBJ type
2.4 Strings
2.4.1 Encapsulated C Strings
+2.4.2 Representation Hacks for 2 Byte wchar_t
3. Garbage Collection
3.1 Root Pointers
3.2 GC-safe Code
@@ -28,6 +29,7 @@ CONTENTS:
4.3. Debugging GC Issues
4.4. Valgrind: Your Friend
+
0. Overview
This is an internals guide to someone who wants to understand, and possibly
@@ -235,7 +237,7 @@ The object system provides three kinds of strings: encapsulated
C strings, regular strings and lazy strings (type tags LIT, STR and LSTR,
respectively).
-2.4.1 Encapsulated C Strings
+2.4.2 Encapsulated C Strings
The design of the dynamic type system recognizes that programs contain literals
and static strings, and that sometimes transient strings are are used which
@@ -279,6 +281,62 @@ string. Note that it is okay if garbage objects contain auto_str values, which
refer to strings that no longer exist, because the garbage collector will
recognize these pointers by their type tag and not use them.
+2.4.1 Representation Hacks for 2 Byte wchar_t
+
+On some systems (notably Cygwin), the wide character type wchar_t is only
+two bytes wide, and the alignment of string literals and arrays is two
+byte. This creates a problem: we need a two-bit type tag in the pointer,
+but pointers have only one spare bit due to their strict alignment.
+
+It turns out that this is not a problem provided that we can ensure that no two
+distinct string objects share the same four byte word, and if we're willing to
+incur a small performance penalty to find the beginning of the string when we
+need it.
+
+On these systems, what we do is add a null character at the beginning of the
+string, and an extra one at the end: So the literal L"abc" is actually
+represented by L"\0" L"abc" L"\0". We then take the pointer to the 'a'
+character as the string, which falls into one of two cases: it is either
+four-byte aligned (case 1), or it is two-byte aligned (case 2). Either way, it
+falls into some four byte cell, either at its base or at its third byte. When
+we add the tag bits 11 (TAG_LIT), we make this pointer point to the fourth byte
+(byte 3) of the four byte cell. To recover the pointer, we remove the tag
+(replace it with bits 00), which leaves us pointing to the base of the
+four-byte cell. The string either starts there (case 1) or two bytes higher
+(case 2). The case is distinguished by looking at the pointed-at wchar_t. If it
+is the null character, then the pointer is incremented to the next character.
+
+The padding at the end of the string ensures that this trick works for the
+null string, where the test for the null character always succeeds.
+
+The lit macro, which existed before this hack, takes care of doing this so most
+code doesn't know the difference.
+
+The new wli macro helps manage this representation when access is needed to C
+string literals which are assigned to wchar_t * variables, and also provides
+type safety by using a different pointer type for strings which have been
+treated with the padding.
+
+ const wchli_t *abc = wli("abc"); /* special type */
+
+ val abc_obj = static_str(abc); /* good: requires const wchlit_t * pointer */
+
+ val xyz_obj = static_str(L"xyz"); /* error */
+
+ val def_obj = static_str(lit("abc")); /* error */
+
+The wini and wref macros manage this representation when character arrays are
+used. The wini macro abstract away the initializer, so the programmer doesn't
+have to be aware of the extra null bytes:
+
+ wchar_t abc[] = wini("abc"); /* potentially six wchar_t units! */
+
+ wchar_t *ptr_a = wref(abc); /* pointer to "a" */
+
+ wref(abc)[1] = L'B'; /* overwite 'b' with 'B' */
+
+On a platform where this hack isn't needed, these w* macros are noops.
+
3. Garbage Collection