2 files changed, 63 insertions, 1 deletions
diff --git a/ChangeLog b/ChangeLog
index d43a419b..3f691d57 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,5 +1,9 @@
 2011-10-09  Kaz Kylheku  <kaz@kylheku.com>
 
+	* HACKING: Documented portability hacks for narrow wchar_t.
+
+2011-10-09  Kaz Kylheku  <kaz@kylheku.com>
+
 	Version 039
 
 	Ported to Cygwin.
diff --git a/HACKING b/HACKING
index e925566b..69c29927 100644
--- a/HACKING
+++ b/HACKING
@@ -17,6 +17,7 @@ CONTENTS:
 2.3  The COBJ type
 2.4  Strings
 2.4.1  Encapsulated C Strings
+2.4.2  Representation Hacks for 2 Byte wchar_t
 3.  Garbage Collection
 3.1  Root Pointers
 3.2  GC-safe Code
@@ -28,6 +29,7 @@ CONTENTS:
 4.3.  Debugging GC Issues
 4.4.  Valgrind: Your Friend
 
+
 0. Overview
 
 This is an internals guide to someone who wants to understand, and possibly
@@ -235,7 +237,7 @@ The object system provides three kinds of strings: encapsulated
 C strings, regular strings and lazy strings (type tags LIT, STR and LSTR,
 respectively).
 
-2.4.1 Encapsulated C Strings
+2.4.2 Encapsulated C Strings
 
 The design of the dynamic type system recognizes that programs contain literals
 and static strings, and that sometimes transient strings are are used which
@@ -279,6 +281,62 @@ string.  Note that it is okay if garbage objects contain auto_str values, which
 refer to strings that no longer exist, because the garbage collector will
 recognize these pointers by their type tag and not use them.
 
+2.4.1 Representation Hacks for 2 Byte wchar_t
+
+On some systems (notably Cygwin), the wide character type wchar_t is only
+two bytes wide, and the alignment of string literals and arrays is two
+byte. This creates a problem: we need a two-bit type tag in the pointer,
+but pointers have only one spare bit due to their strict alignment.
+
+It turns out that this is not a problem provided that we can ensure that no two
+distinct string objects share the same four byte word, and if we're willing to
+incur a small performance penalty to find the beginning of the string when we
+need it.
+
+On these systems, what we do is add a null character at the beginning of the
+string, and an extra one at the end: So the literal L"abc" is actually
+represented by L"\0" L"abc" L"\0".  We then take the pointer to the 'a'
+character as the string, which falls into one of two cases: it is either
+four-byte aligned (case 1), or it is two-byte aligned (case 2). Either way, it
+falls into some four byte cell, either at its base or at its third byte. When
+we add the tag bits 11 (TAG_LIT), we make this pointer point to the fourth byte
+(byte 3) of the four byte cell.  To recover the pointer, we remove the tag
+(replace it with bits 00), which leaves us pointing to the base of the
+four-byte cell. The string either starts there (case 1) or two bytes higher
+(case 2). The case is distinguished by looking at the pointed-at wchar_t. If it
+is the null character, then the pointer is incremented to the next character.
+
+The padding at the end of the string ensures that  this trick works for the
+null string, where the test for the null character always succeeds.
+
+The lit macro, which existed before this hack, takes care of doing this so most
+code doesn't know the difference.
+
+The new wli macro helps manage this representation when access is needed to C
+string literals which are assigned to wchar_t * variables, and also provides
+type safety by using a different pointer type for strings which have been
+treated with the padding.
+
+  const wchli_t *abc = wli("abc"); /* special type */
+
+  val abc_obj = static_str(abc); /* good: requires const wchlit_t * pointer */
+
+  val xyz_obj = static_str(L"xyz"); /* error */
+
+  val def_obj = static_str(lit("abc")); /* error */
+
+The wini and wref macros manage this representation when character arrays are
+used. The wini macro abstract away the initializer, so the programmer doesn't
+have to be aware of the extra null bytes:
+
+  wchar_t abc[] = wini("abc"); /* potentially six wchar_t units! */
+
+  wchar_t *ptr_a = wref(abc); /* pointer to "a" */
+
+  wref(abc)[1] = L'B'; /* overwite 'b' with 'B' */
+
+On a platform where this hack isn't needed, these w* macros are noops.
+
 
 3. Garbage Collection