diff options
author | Kaz Kylheku <kaz@kylheku.com> | 2011-09-01 10:32:52 -0700 |
---|---|---|
committer | Kaz Kylheku <kaz@kylheku.com> | 2011-09-01 10:32:52 -0700 |
commit | 0a22f631031a8465a289a2c58c57ef1986162f38 (patch) | |
tree | 006d96368c6df1c5f2f9379bec54de5ca231ff78 /HACKING | |
parent | 7a8da4a56c50d7134e78e9b86314c33675c857c9 (diff) | |
download | txr-0a22f631031a8465a289a2c58c57ef1986162f38.tar.gz txr-0a22f631031a8465a289a2c58c57ef1986162f38.tar.bz2 txr-0a22f631031a8465a289a2c58c57ef1986162f38.zip |
This should be under version control.
Diffstat (limited to 'HACKING')
-rw-r--r-- | HACKING | 516 |
1 files changed, 516 insertions, 0 deletions
diff --git a/HACKING b/HACKING new file mode 100644 index 00000000..308d7c95 --- /dev/null +++ b/HACKING @@ -0,0 +1,516 @@ + Txr Internals Guide + Kaz Kylheku <kkylheku@gmail.com> + + +0 Overview + +This is an internals guide to someone who wants to understand, and possibly +change or extend the txr program. The purpose is to give explanations, +provide rationale and make coding recommendations. + + +1 Coding Practice + +1.1 Language + +Txr is written in a language that consists of the common dialect between C90 and C++98. +The code can be built with either the GNU C compiler or the GNU C++ compiler. +Use is made of some POSIX.2 functions, which are requested by means of -D_POSIX_2_SOURCE. +Also, the <wchar.h> header is used, which was introduced by a 1995 addendum +to the C language, so it may be said that the actual C dialect is C95. + +In coding new features or fixing bugs, care must be taken to preserve this. Code must +continue to compile as C and C++, and not increase the portability requirements. + +C++ compilation can be arranged using ./configure --ccname=g++ (for instance). + +Note that txr is not written takes some nonportable liberties with the +language, such as encoding bit fields into pointers, and treating automatic +storage as a flat stack which can be treated as an array that can be walked by +a garbage collector looking for references to objects. There are assumptions +about the alignment of objects too. + +1.2 Program File Structure + +The txr code has a simple flat structure: a collection of .c files +(and also a .l flex file and a .y yacc file) and headers. +The txr project follows the include header style that every C source file +includes all needed headers, in the proper order. Headers do not include other +headers. + +The generation of the dependency makefile dep.mk depends on this; the +depend.txr script does not scan headers for inclusion of other headers. If +this stylistic decision is ever changed, the dependency generation will have to +be updated. + + +1.3 Style + +Tab characters are avoided in txr source files. The indentation is two characters. +Formatting is similar to K&R, though the yacc grammar files use a Lispy formatting. +Expression or statement elements which are syntactically parallel, but +on separate lines, must be horizontally aligned with each other: + + if (function(argument1, + argument)) + +father than: + + if (function(argument1, + argument)) + +The opening brace of a function goes on a separate line. if/else braces +``cuddle'' into the previous line, except when the condition spans multiple +lines: + + if (multi + + line + + condition) + { + /* brace doesn't cuddle */ + } else { + + } + +switch cases indent with the switch: + + switch (x) + { + case ... + ... + break; + } + +switches handle all enumeration members; default cases have a break +even if they are last in the block. The following style is permitted + + if (...) { + ... + } else switch (...) { + case ... + + } + +Forward and backward goto are permitted, unless it is /glaringly/ +obvious that the code can be written better without it. + + +1.3 Error Handling + +Multiple return points from functions are encouraged. Txr has a garbage +collector, so there is usually no need to branch to a common cleanup just for +the sake of freeing memory. + +Txr also has exceptions; code that must free some resource other than +garbage collected memory if a failure occurs, must be exception safe. + +Exceptions should be used for both internal errors and environmental +situations. The internal_error macro is preferred to calling abort. + + +1.4 I/O + +Use of the C streams and printf must be avoided. Txr has its own +streams and its own formatter function called format. Printing to +a dynamic string is supported. There are three global streams: +std_output, std_input and std_error. These streams don't do everything +that standard I/O streams can do, such as binary I/O, but +their capabilities can be extended. + + +1.5 Regression + +All changes must be verified not to break the test cases. This is done by +running ``make tests''. Running ``make tests'' is not possible if the code is +being cross-compiled; in that case run ``make install-tests'' after ``make +install''. This will add the test cases and a shell script to run them to the +installation. The cases can then be installed and run on the cross target. + + +2. Dynamic Types + +The txr code is organized around a dynamic typing paradigm implemented in C. +Values are represented by the C type val, which is a typedef name for +a pointer to obj_t, i.e obj_t *. + + +2.1 Two Kinds of Values + +A value of type val falls into two kinds: heaped and immediate. + +An heaped val points to an obj_t object, which is a union of a number +of structure types, discriminated by a type field. + +An immediate val actually contains the value inside the pointer, and does not +point to anything. + + +2.1 Pointer Bitfield + +Immediate and heaped values are distinguished by a two-bit field in the least +signficant bits of the pointer. If the two bits are 00 (i.e. the pointer is +four-byte-aligned) then the value is a pointer to a heaped object, unless +it is the null pointer. The null pointer is understood to be the object +nil. The is_ptr(v) macro evaluates true for a value v which is not nil, +and which points to a heap object (at least according to its bit field; +is_ptr does not validate the pointer). + +The codes 01 10 and 11 indicate immediate values: values of type NUM, CHR and +LIT, respectively. That is to say, if the tag bits are 01, then then remaining +upper bits of the pointer constitute a signed integer. The range of this +integer is NUM_MIN to NUM_MAX, defined in lib.h. The code 10 is for +characters: the remaining bits of the pointer encode a wchar_t value. The bits +11 indicate that the object is a pointer to an encapsulated C string (of +wide characters), which is most often a literal. See the subsection +Encapsulated C Strings below. Only C strings whose first character is suitably +aligned can be represented as LIT objects. The address of the first character +of the string is formed by masking out the 11 code, leaving a pointer which is +four-byte aligned. + + +2.2 Heap Objects + +Heap types are an union of various structures: union obj. The obj_t name is a +typedef. All of the structures are no larger than four pointer-sized words, +including the type tag, and it should be kept that way. Heaps are managed as +arrays of this union obj. If any one of the union members is made larger than +four words, then the heap size will increase. + +Though the type tag is defined by a enumeration, for memory management +purposes, the type field is overloaded with additional bitmasked values. +The FREE flag in a type field marks an object on the free list. +The REACHABLE flag is the marking bit used during garbage collection. + + +2.3 The COBJ type + +The COBJ type is a mechanism whereby a ``native'' C type can be integrated into +the dynamic type system. Under the COBJ model, the heap allocated object of +type COBJ serves as a handle which points to a separately allocated C object, +which can be an arbitrary structure. The relationship between the dynamic world +and this object is managed through a registered table of operations. The module +managing that object must provide functions for dealing with garbage +collection, printing, equality and hashing. The garbage collector hooks allow +the object's module to be notified when the associated COBJ handle becomes +unreachable. The associated C object may contain references to dynamic objects +(i.e. members of type val). In that case, it must provide the mark function, +which, when invoked, must traverse the object's members of type and recursively +invoke mark_obj on all of them. + + + + +2.4 Strings + +All string manipulation should be done using the dynamic object system. +The object system provides three kinds of strings: encapsulated +C strings, regular strings and lazy strings (type tags LIT, STR and LSTR, +respectively). + +2.4.1 Encapsulated C Strings + +The design of the dynamic type system recognizes that programs contain literals +and static strings, and that sometimes transient strings are are used which +have temporary lifetimes. Therefore, a special provision is made in the val +type to be able to represent C strings directly, without having to create +dynamically allocated copies in heap storage. These C strings represented as +values of type val are referred to in this document as encapsulated C strings. +A C string whose address is aligned on a four-byte boundary, or more strictly, +is converted to an encapsulated C string by masking the bits 11 into the least +significant two bit positions of its pointer, and then manipulated as a value +of type val (pointer to obj_t). + +Enapsulated C strings can be transparently used wherever the other kinds of +strings can be used, so the benefit is immense, for the small cost of a bit +operation. + +Most often, this feature is used for literals, and the lit macro is provided +for this situation. The macro call lit("abc") produces a value of type val +which represents the wide string L"abc". + +However, C strings other than literals can be encapsulated as values also. The +most obvious candidates are static strings which are arrays, rather than +literals, and stack-allocated strings, which C programs often use as +efficient temporary buffers for character manipulation. Two functions are +provided for converting these kinds of strings to encapsulated strings: the +functions static_str and auto_str. They do the same thing: simply take the +wchar_t * pointer and convert it to a obj_t * pointer with the bits 11 in the +tag field (thus requiring that the C string pointer be aligned such +that these bits are originally 00). Two different functions which do the same +thing are provided, because it is generally much safer to convert a static +string to a val (due to its indefinite lifetime) than an automatic string +(which becomes indeterminate when the enclosing block terminates). Care should +be taken to only ever use auto_str to wrap a stack-allocated string as a val, +so that such usage can be found in the program bys searching for occurences of +``auto_str''. Secondly, care should be taken to ensure that values produced by +auto_str do not try to escape beyond the lifetime of the enclosing block. If +they are passed to functions those functions must not retain the value in any +persistent place. For instance if an object is constructed which contains an +automatic string, that object must not be used beyond the lifetime of that +string. Note that it is okay if garbage objects contain auto_str values, which +refer to strings that no longer exist, because the garbage collector will +recognize these pointers by their type tag and not use them. + + +3 Garbage Collection + +Txr has a fairly simple mark-and-sweep garbage collector. The collector marks +objects by performing a depth-first-search over the graph formed by +inter-object references, starting at certain root values. Objects which are +not marked are identified during the sweep phase, which is a linear scan +through the object heaps, and placed on the free list. + +During the marking phase, the bit value 0x100 (denoted by the symbolic constant +REACHABLE) is used to mark reachable objects. This flag is reset during +the sweep phase, but the flag 0x200 (the value of the symbolic constant FREE) +is added to the type field of objects on the free list. This FREE flag has +the effect of ``poisoning'' free objects: if an object is prematurely +reclaimed (indicating a bug in the garbage collection system), uses +of that object will see a bad type tag, so that there is a good chance +the program will throw an exception due to a failed type check. + +3.1 Root Pointers + +The marking phase of the garbage collector looks in two places for root +pointers: by scanning the entire call stack, and by looking at a registered +list of global variables. + +Scanning the stack means that the garbage collector is conservative: it could +encounter values which look like valid object references, but are actually only +accidentally so due to having the right bit pattern. When this happens, +objects that should be considered garbage will remain live. + +Global root pointers are registered individually using the prot1 function, +or many at once using the protect function. Care must be taken to properly +null-terminate the variable argument list to protect. It does not use the +nao convention, but rather (val *) 0. + +The garbage collector takes care to also scan the machine registers. This is +currently done using a broadly portable approach, namely recording the machine +state into the stack with the setjmp macro. + + +3.2 GC-safe Code + +Since garbage collection is being used in code processed by a compiler which +knows nothing about garbage collection, it is important to obey certain rules +so that the code is gc-safe. Code which is not gc-safe is susceptible +to two potential serious problems: the premature garbage collection of an +object, and accesses, in the garbage allocator, to uninitialized parts of an +object. + +The rules for gc-safe code are not difficult in txr, due to the immense +simplification that the garbage collector scans the stack and registers. If a +value is in an automatic local variable, or if the code is working with the +value as the result of an expression, function return, or passing it as a +function parameter, that value is visible to the gc and protected. Thus, the +rules only have to be followed in lower-level code which is close to the +allocator. Normal application code does not have to follow any special rules. + +The garbage collector is called implicitly by code which calles make_obj to +pull a raw object from the garbage collector's free list. + +3.2.1 Rule One: Full Initialization + + A function which calls make_obj must not be hanging on to any + references to a partially initialized object. + +Any partially initialized object may be visited by the garbage collector during +the call to make_obj. A partially initialized object may have a type code which +still indicates that it is free. If the garbage collector encounters an object +on the stack which is free, it will simply skip that object. This means that +the sweep phase may then return that object to the free list. If a free object +is encountered during transitive marking, the garbage collector will abort. + +In other words, if the program allocates an object from the free list, but then +accidentally invokes the garbage collector prior to completing the +initialization of that object, the object may be reclaimed back to the +free list and the program is then working with a freed object; or +the program may even abort. + +If the program initializes only the type field of the object from make_obj, +but not the other fields that may contain a value of type val, and then invokes +the garbage collector, the garbage collector will treat that object as visible, +and then try to mark the val-typed fields of that object, thereby using +uninitialized memory. + +The full initialization rule is therefore that after make_obj is called, the +object must be fully initialized before doing any other operation that may +allocate gc memory. Fully initialized means that the type field is initialized, +as well as any other field that is visited during garbage collection. + +3.2.2 Rule Two: Make it Reachable + + A function which constructs an object must place it in live, reachable + storage before attempting to construct another object. + +The garbage allocator does not scan all of memory for root pointers, only the +stack and registered globals. So for instance, if the only reference to an +object is inside a dynamically allocated structure, and that structure is not +visible to the allocator, then if gc is invoked, that object will be reclaimed. +So the following pattern is incorrect. + + { + some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t); + t->value = cons(foo, bar); + return cobj((mem_t *) t, some_type_symbol, &some_type_ops); + } + +There are three allocations in the code. The allocation of the structure +assigned to pointer t, the allocation of the cons cell stored in t->value, and +the allocation of the COBJ. The issue is that the object t is not known to the +allocator. It is a ``native'' C type, which the garbage collector will not +traverse. The garbage allocator can see the pointer t, because it scans the +stack and registers, but that object means nothing to the garbage collector, +and so the collector cannot find and mark the t->value member. + +Of course, the operations structure ``some_type_ops'' presumably contains a +mark function which knows how to traverse this object and find values inside +it. But that does not come into play until this object is registered with a +COBJ, which does not happen until the last line in the above block +where the cobj function is called. After the cobj call, the t pointer +is hooked into the COBJ object, and visible to the garbage collector. + +So the object allocated by cons(foo, bar) is put into a structure which is +yet invisible to the allocator, and that reference is the only live reference +which the program has to that cons cell. Consequently, the subsequent call to +the allocator, hidden inside the cobj function, may trigger gc, and cause this +cons cell to be reclaimed into the free list! + +The following adjustment does not fix the problem: + + { + val c = cons(foo, bar); + some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t); + t->value = c; /* still wrong */ + return cobj((mem_t *) t, some_type_symbol, &some_type_ops); + } + +Even though the cons cell is now also held in a local variable, as well as in +the structure, it is still not necessarily visible to the garbage collector. +The problem is that after the ``t->value = c'' assignment, the variable c is no +longer live. Variable liveness is a concept from dataflow analysis, which +is a process implemented in optimizing compilers. A variable is live at some +point in the code if the value stored in it has a next use: another code can be +reached from that point which uses the value. The variable c has no next use +after the t->value = c assignment. There is only one execution path from that +point in the code, and that path leads to the termination of the block, which +destroys c. Essentially, the t->value struture member is the sink for the +data flow which carries the cons cell: The data flow emanates from the call +cons(foo, bar), and terminates in t->value. + +There are several right ways to fix this: + + { + val co; + some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t); + t->value = nil; + co = cobj((mem_t *) t, some_type_symbol, &some_type_ops); + t->value = cons(foo, bar); + return co; + } + +The above properly initializes the structure, and then associate it with the +COBJ. This makes the structure visible to the garabge collector (through the co +variable, which is live at the point where the cobj function is called, due +to having a next use in the return statement!) Now we can safely stash a newly allocated +cons cell into that structure, allowing that structure to hold the one and only +reference to that object. + +Another approach, which avoids two-step initialization of the structure: + + { + val c = cons(foo, bar); + some_struct_type *t = (some_struct_type *) chk_malloc(sizeof *t); + co = cobj((mem_t *) t, some_type_symbol, &some_type_ops); + t->value = c; + return co; + } + +In this situation, the variable c maintains a live, gc-visible reference to the +cons across the cobj allocation. The variable c is live at the point of the +cobj call because it has a next use: its value is used in the subsequent +assignment to t->value. We don't initialize the structure because even if +the cobj function triggers gc, the gc cannot yet see that structure and +so there is no danger. After cobj returns, the first thing we do is +initialize the structure (obeying the first rule of gc-safe code). +Just after cobj returns, the structure is uninitialized and visible to the +garbage collector, but there is nothing that will trigger gc prior to +the initialization. + +Note that this premature collection problem also affects functions which simply +take an existing object and put it into a structure, where it is not obvious +that an object may have been allocated which is not visible to gc, + + /* Looks harmless: allocate structure, stick the argument object + into it and make a COBJ! */ + + val make_foo(val member) + { + foo *f = (foo *) chk_malloc(sizeof *foo); + f->mem = member; /* Oops, member is no longer live. */ + return cobj((mem_t *) f, ...); + } + +The problem is that the caller which invokes foo might not maintain any live +reference to the argument object either, and so the f->mem = member might +be the one and only sink for the data flow carrying that object; i.e. +the one and only reference to that object in the entire program. +One way that can happen is that the object is just a temporary that is +allocated in the function call itself: + + make_foo(string("abc")); /* oops! */ + +The make_foo function can be correct like this: + + val make_foo(val member) + { + cobj co; + foo *f = (foo *) chk_malloc(sizeof *foo); + co = cobj((mem_t *) f, ...); + f->mem = member; + return co; + } + + +3.3 Weak Reference Support + +COBJ objects can support weak pointers, but there is no fully encapsulated +interface for this; to be more specific, adding a new module of objects that +have weak references, it is necessary to to add a function call code into the +garbage collection functino. + +Modules with weak references should closely follow the pattern used by the hash +module. Hash tables are implemented using COBJ, and provide weak key and value +support thanks to cooperation with the gc module. + +Weak references work as follows. During gc marking, the COBJ module +must maintain a list of all objects of its kind which are marked +(or at least just that subset of them which contains weak references). +It must refrain from marking the weak references contained in these +objects, but rather leave them unmarked. + +After the initial marking phase, gc will call a global function in each module +that manages objects with weak references. (Currently there is only one such +function: hash_process_weak; a similar function must be writen +for a new module and added). + +This function must process and clear the weak list gathered during the +initial marking. Each weak reference in each object on this weak +marked list must be inspected to see whether it refers to an object which is +still reachable. Weak references which point to values which have not been +reached (do not have the REACHABLE bit) must be lapsed according to the +object's rules for lapsing weak references. For instance, a hash table with +weak keys will delete a key/value pair if the key reference lapses. A weak +pointer container object might convert a lapsed weak reference to the value +nil. + +Weak objects can defer marking certain other non-weak objects. For instance +the hash module, during marking, does not mark the vector object that serves as +the hash chain table (at least not for weak hashes), and neither does it mark +the conses which make up the hash chains emanating from that vector. This +marking is completed in hash_process_weak. After the lapsed entries are removed +(their conses are spliced out of the chains), then the vector is marked, which +transitively causes the chain conses to be marked. The conses that were removed +due to the lapsing of weak keys thus stay unmarked and are reclaimed during +the sweep phase of the gc, which soon follows. |