From ddb0601e8e26255b8b9b536a5e6a47b86c33b011 Mon Sep 17 00:00:00 2001 From: Kaz Kylheku Date: Thu, 12 Nov 2009 11:44:25 -0800 Subject: Regular expression module updated to do unicode character sets. Most of the changes are in the area of representing sets. Also, a bug was found in the compilation of regex character sets: ranges straddling two adjacent blocks of 32 characters were not being added to the character set. However, ranges falling within a single 32 block, or spanning three or more such blocks, worked properly. This bug is not tickled by common ranges such as A-Z, or 0-9, which land within a 32 block. --- ChangeLog | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) (limited to 'ChangeLog') diff --git a/ChangeLog b/ChangeLog index 1cc9198c..4fbbf5bb 100644 --- a/ChangeLog +++ b/ChangeLog @@ -1,3 +1,57 @@ +2009-11-12 Kaz Kylheku + + Regular expression module updated to do unicode character sets. + Most of the changes are in the area of representing sets. + + Also, a bug was found in the compilation of regex character sets: + ranges straddling two adjacent blocks of 32 characters were + not being added to the character set. However, ranges falling + within a single 32 block, or spanning three or more such blocks, + worked properly. This bug is not tickled by common ranges + such as A-Z, or 0-9, which land within a 32 block. + + * regex.h (BITCELL_LIT): Macro removed. + (CHAR_SET_SIZE): Macro does not depend on UCHAR_MAX any more, + but hard-codes a set size of 256. UCHAR_MAX means nothing to us any + more since we are using wchar_t. The number 256 is simply an + arbitrarily chosen size for representing the small character + sets (or the leaves of the radix tree for representing large sets). + (chset_type_t): New enum typedef. + (cset_L0_t, cset_L1_t, cset_L2_t, cset_L3_t): New array typedefs. + (struct char_set): Replaced by union char_set. + (struct any_char_set, struct small_char_set, struct displaced_char_set, + struct large_char_set, struct xlarge_char_set): New struct types. + (char_set_clear): Declaration removed. + (char_set_create, char_set_destroy): Declared. + (char_set_add, char_set_add_range, char_set_contains, + nfa_state_single, nfa_state_set, nfa_machine_feed): Declarations + updated for wchar_t. + (struct nfa_state_single): member ch changed to wchar_t. + + * regex.c (char_set_clear): Function removed. + (CHAR_SET_L0, CHAR_SET_L1, CHAR_SET_L2, CHAR_SET_L3, CHAR_SET_L2_L0, + CHAR_SET_L2_HI, CHAR_SET_L1_L0, CHAR_SET_L1_HI, CHAR_SET_L0_L0, + CHAR_SET_L0_HI): New macros. + (L0_full, L0_fill_range, L0_contains, L1_full, L1_fill_range, + L1_contains, L1_free, L2_full, L2_fill_range, L2_contains, + L2_free, L3_fill_range, L3_contains, char_set_create, + char_set_destroy): New functions. + (char_set_compl): Works using a flag rather than by actually + computing a complemented set. Also, is no longer a toggle (and + was never used that way). + (char_set_add, char_set_add_range, char_set_contains): Polymorphic over + the different set types. + (nfa_state_single, nfa_move, nfa_run, nfa_machine_feed): Converted + to wchar_t. + (nfa_state_free): Use char_set_destroy to free set. + (nfa_state_set): Does not construct the set internally but + takes it as a parameter. + (nfa_compile_set): Rewritten to perform two passes over the + s-expression representing the list of characters and ranges + making up the set. The first pass determines what representation + will be used for the set. The second pass stuffs the characters and + ranges into the set. + 2009-11-11 Kaz Kylheku * txr.c (main): call setlocale to set the LC_CTYPE to en_US.UTF-8, -- cgit v1.2.3