txr - TXR: A data munging language.

	Commit message (Collapse)	Author	Age	Files	Lines
*	* eval.c (eval_init): Update registration of regex-compile	Kaz Kylheku	2013-12-06	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to reflect that it has two arguments now. * parser.y (grammar): Update calls to regex_compile to pass two arguments. Since we don't expect regex_compile to parse, we specify the error stream as nil. (spec): The "secret syntax" for a regex is simplified not to include the slashes. This provides better diagnostics for unterminated syntax and requires less string processing to generate. Also, the form returned doesn't have the regex symbol consed onto it, which parse_regex just throws away. * regex.c (regex_compile): Now takes a stream argument. * regex.h (regex_compile): Declaration updated. * txr.1: Updated
*	* regex.c (regex_compile): Handle string input.	Kaz Kylheku	2013-12-05	1	-1/+5
\| \| \| \| \| \| \|	* regex.h (regex_compile): Don't call argument regex_sexp, since it can be a string. * txr.1: Updated.
*	* regex.c (regex_space_chars): Variable removed.	Kaz Kylheku	2012-04-20	1	-22/+16
\| \| \| \| \| \| \| \| \|	(char_set_addr_str): New function. (char_set_compile): Use char_set_addr_str to add spaces to set. (init_special_char_sets): Use char_set_addr_str to add spaces to set. Bugfix: word_cs, cword_cs wrongly initialized. (regex_init): Removed reference to regex_space_chars.
*	* parser.y (regtoken): New nonterminal symbol.	Kaz Kylheku	2012-04-20	1	-1/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	(regterm): REGTOKEN production factored out to regtoken. (regclass): Reverted prior commmit's changes. (regclassterm): Reverted prior commit, removing REGTOKEN production for character classes, and introduced a regtoken production. So now the keyword symbols are part of the character class abstract syntax. (regtoken): New production rule. * regex.c (regex_space_chars): Converted to internal linkage. (char_set_compile): Handle token keywords in character class abstract syntax. * regex.h (regex_space_chars): External declaration removed.
*	First cut at implementing \s, \d, \w, \S, \D and \W regex tokens.	Kaz Kylheku	2012-04-19	1	-3/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* lib.c (init): Call regex_init. * parser.l: return new REGTOKEN kind. * parser.y (REGTOKEN): New token type. (REGTERM): Translate REGTERM to keyword. (regclass): Restructured to handle inherited nodes as lists. (regclassterm): Produce $$ as list. Add handling for REGTOKEN occurring inside character class by expanding it. This might not be the best approach. (yybadtoken): Handle REGTOKEN in switch. * regex.c (struct any_char_set, struct small_char_set, struct displaced_char_set, struct large_char_set, struct xlarge_char_set): New bitfield member, stat. (char_set_create): New parameter for indicating static char set. (char_set_destroy): Do not free a static char set. (char_set_compile): Pass zero to new parameter of char_set_create. (spaces): New static array. (space_cs, digit_cs, word_cs, cspace_cs, cdigit_cs, cword_cs): New static pointers to char_set_t. (init_special_char_sets, nfa_compile_given_set): New static function. (nfa_compile_regex, dv_compile_regex): Handle new character set token keywords. (space_k, digit_k, word_char_k, cspace_k, cdigit_k, cword_char_k, regex_space_chars): New variables. (regex_init): New function. * regex.h (space_k, digit_k, word_char_k, cspace_k, cdigit_k, cword_char_k, regex_space_chars, regex_init): Declared.
*	Improve the regex Lisp syntax by allowing strings to specify	Kaz Kylheku	2012-04-12	1	-4/+12
\| \| \| \| \| \| \| \| \| \| \|	character compounds. I.e. the syntax "foo" is equivalent to the cumbersome canonical form (compound #\f #\o #\o). * regex.c (nfa_compile_regex, dv_compile_regex): Use chrp function instead of typeof. Handle stringp case by forming a compound out of the characters and recursing. Check for some bad objects in the regex that would never come out of our regex parser but could occur in a "hand crafted" syntax tree.
*	* eval.c (eval_init): Expose regex-compile and regexp as intrinsics.	Kaz Kylheku	2012-04-10	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	* lib.c (obj_init): Change spelling of nongreedy operator and put it into the user package so that it is available for use with regex-compile. * regex.c (match_regex, search_regex): Bugfix: optional start position argument argument not defaulting to zero. * txr.1: Documented regex-compile and regexp. * txr.vim: Highlighting regex-compile and regexp.
*	Changing type function to not blow up on nil, which makes a lot of code	Kaz Kylheku	2012-03-17	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	simpler. A pseudo type code is introduced called NIL with value 0. * lib.h (enum type): New enumeration value, NIL. (type): Function accepts object nil and maps it to code NIL. * eval.c (dwim_loc, op_dwim): test for nil obj and goto hack is gone, just handle NIL in the switch. * gc.c (make_obj, mark): Handle new NIL type code in switch. * hash.c (equal_hash): Handle NIL in the switch instead of nil test. * lib.c (code2type): Map new NIL type code to null. (typeof, typecheck): Code simplified. (class_check, car): Move nil test into switch. (eql, equal, consp, bignump, stringp, lazy_stringp, symbolp, functionp, vectorp, cobjp): Simplified. (length, sub, ref, refset, replace, obj_print, obj_pprint): Handle NIL in switch instead of nil test. goto hack removed from refset. * match.c (do_match_line, do_output_line): switch condition simplified. * regex.c (regexp): Simplified. (regex_nfa): Assert condition simplified.
*	* regex.c (regsub): the replacement argument	Kaz Kylheku	2012-03-13	1	-1/+4
\| \| \| \| \| \|	can now be a function of one argument which maps the original piece of text matched by the regex to a replacement text.
*	Bug #35718. Workaround good enough to get some code working.	Kaz Kylheku	2012-03-04	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	* eval.c (cons_find): New function. (expand_op): Use cons_find rather than tree_find to look for rest_gensym. * regex.c (regsub): Rearranged arguments so that the string is last. This is better for partial evaluaton via the op operator. * regex.h (regsub): Updated declaration.
*	* eval.c (eval_init): New intrinsic function, regsub.	Kaz Kylheku	2012-03-04	1	-0/+28
\| \| \| \| \| \| \| \|	* regex.c (regsub): New function. * regex.h (regsub): Declared. * txr.1: Doc stub added.
*	* arith.c: Updated copyright year.	Kaz Kylheku	2012-02-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* arith.h: Likewise. * debug.c: Added copyright header. * debug.h: Updated copyright year. * eval.c: Likewise. * eval.h: Likewise. * filter.c: Likewise. * filter.h: Likewise. * gc.c: Likewise. * gc.h: Likewise. * hash.c: Likewise. * hash.h: Likewise. * lib.c: Likewise. * lib.h: Likewise. * match.c: Likewise. * match.h: Likewise. * parser.h: Likewise. * regex.c: Likewise. * regex.h: Likewise. * stream.c: Likewise. * stream.h: Likewise. * txr.c: Likewise, and e-mail address. * txr.h: Updated copyright year. * unwind.c: Likewise. * unwind.h: Likewise.
*	We don't include headers in headers in this project.	Kaz Kylheku	2011-10-30	1	-0/+1
\| \| \| \| \| \| \| \|	* parser.h: Do not include <stdio.h> * regex.c: Include <limits.h> * regex.h: Do not include <limits.h>
*	Improved support for broken unicode.	Kaz Kylheku	2011-10-10	1	-1/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Regex support for extra-large character sets not compiled in if wchar_t is not wide enough for it. The utf-8 properly throws exceptions when encountering characters that it cannot represent, instead of silently ignoring the situation and continuing with incorrectly computed data. * regex.c (FULL_UNICODE): New macro. (CHAR_SET_L3, CHAR_SET_L2_LO, CHAR_SET_L2_HI): Only defined if full unicde is available. (CHSET_XLARGE, cset_L3_t, struct xlarge_char_set, L2_full, L3_fill_range, L3_contains): Ditto. (unon char_set): Member x1 present only under FULL_UNICODE. (char_set_destroy, char_set_add, char_set_add_range, char_set_contains): CHSET_XLARGE cases only available on FULL_UNICODE. (char_set_compile): Default cst variable to CHSET_LARGE. * utf8.c (FULL_UNICODE): New macro. (conversion_error): New function. (utf8_from_uc): Throw error if not FULL_UNICODE and character is outside the BMP. (utf8_decode): Likewise.
*	* LICENSE, Makefile, configure, filter.c, filter.h, gc.c, gc.h, hash.c,	Kaz Kylheku	2011-10-04	1	-1/+1
\| \| \| \| \| \|	hash.h, lib.c, lib.h, match.c, match.h, parser.h, parser.l, parser.y, regex.c, regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h: Updated e-mail address.
*	* LICENSE, Makefile, configure, gc.c, gc.h, hash.c, hash.h, lib.c,	Kaz Kylheku	2011-09-23	1	-1/+1
\| \| \| \| \| \|	lib.h, match.c, match.h, parser.h, parser.l, parser.y, regex.c, regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h: Updated copyright year.
*	Bump copyrights to 2010.	Kaz Kylheku	2010-10-05	1	-1/+1
\|
*	Fix inaccurate comment.	Kaz Kylheku	2010-01-26	1	-4/+4
\|
*	Optimization in derivative-based regex engine.	Kaz Kylheku	2010-01-26	1	-1/+54
\| \| \| \| \| \| \| \|	Exponential memory consumption behavior was observed when matching the input aaaaaa.... against the regex a?a?a?a?....aaaa.... The fix is to eliminate common subexpressions from the derivative for the or operator.
*	* regex.c (reg_derivative_list, reg_derivative): Recognition	Kaz Kylheku	2010-01-18	1	-6/+29
\| \| \| \| \| \| \| \| \|	of cases to reduce consing. In reg_derivative_list, we avoid consing the full or expression if either branch is t, and also save a cons when the first element has a null derivative. In reg_derivative the oneplus and zeroplus cases are split, since zeroplus can re-use the input expression, when it's just a one-character match, deriving nil.
*	Adjust semantics of non-greedy operator R%S, to avoid the broken	Kaz Kylheku	2010-01-18	1	-3/+9
\| \| \| \| \| \| \| \|	case whereby R%S matches nothing at all when S is not empty but equivalent to empty, or more generally when S is nullable. A much nicer definition is ``the intersection of R* and the set of all strings that do not contain a non-empty substring that matches S, followed by S''.
*	Implemented non-greedy operator.	Kaz Kylheku	2010-01-15	1	-1/+20
\|
*	* regex.c (reg_derivative_list): Bugfix: wrong algebra,	Kaz Kylheku	2010-01-15	1	-1/+1
\| \| \| \|	taking a double derivative of the first item.
*	* regex.c (reg_derivative): Bugfix: remove invalid	Kaz Kylheku	2010-01-14	1	-9/+1
\| \| \| \|	algebraic reductions in the derivative for the operator.
*	Dynamically determine which regex implementation to use:	Kaz Kylheku	2010-01-13	1	-2/+30
\| \| \| \| \| \| \|	NFA or derivatives. The default behavior is NFA, with derivatives used if the regular expression contains uses of complement or intersection. The --dv-regex option forces derivatives always.
*	Impelement derivative-based regular expressions.	Kaz Kylheku	2010-01-13	1	-248/+557
\|
*	Remove incorrect implementation of extended	Kaz Kylheku	2010-01-06	1	-273/+32
\| \| \| \| \|	regex operations (complement, intersection). The syntax extensions documentation are retained.
*	Implemented the regular expression ~ and & operators.	Kaz Kylheku	2010-01-05	1	-32/+273
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This turns out to be easy to do in NFA land. The complement of an NFA has exactly the same number and configuration of states and transitions, except that the states have an inverted meaning; and furthermore, failed character transitions are routed to an extra state (which in this impelmentation is permanently allocated and shared by all regexes). The regex & is implemented trivially using DeMorgan's. Also, bugfix: regular expressions like A\|B\|C are allowed now by the syntax, rather than constituting syntax error. Previously, this would have been entered as (A\|B)\|C.
*	All COBJ operations have default implementations now;	Kaz Kylheku	2009-12-08	1	-6/+5
\| \| \| \| \| \|	no null pointer check over struct cobj_ops operations. New typechecking function for COBJ objects.
*	Eliminate the void * disease. Generic pointers are of mem_t *	Kaz Kylheku	2009-12-04	1	-1/+1
\| \| \| \| \|	from now on, which is compatible with unsigned char *. No implicit conversion to or from this type, in C or C++.
*	Code cleanup. All private functions static. Private stuff	Kaz Kylheku	2009-11-28	1	-36/+136
\| \| \| \|	in regex module not exposed in header. Etc.
*	Changes to make the code portable to C++ compilers, which	Kaz Kylheku	2009-11-24	1	-9/+9
\| \| \| \|	can be taken advantage of for better diagnostics.
*	Renaming global variables that denote symbols, such that they	Kaz Kylheku	2009-11-24	1	-16/+16
\| \| \| \|	have a _s suffix.
*	Improving portability. It is no longer assumed that pointers	Kaz Kylheku	2009-11-23	1	-5/+6
\| \| \| \| \| \| \| \|	can be converted to a type long and vice versa. The configure script tries to detect the appropriate type to use. Also, some run-time checking is performed in the streams module to detect which conversions specifier strings to use for printing numbers.
*	Introducing symbol packages. Internal symbols are now in	Kaz Kylheku	2009-11-21	1	-1/+2
\| \| \| \| \| \| \| \| \| \|	a system package instead of being hacked with the $ prefix. Keyword symbols are provided. In the matcher, evaluation is tightened up. Keywords, nil and t are not bindeable, and errors are thrown if attempts are made to bind them. Destructuring in dest_bind is strict in the number of items. String streams are exploited to print bindings to objects that are not strings or characters. Numerous bugfixes.
*	Changing ``obj_t *'' occurences to a ``val'' typedef. (Ideally,	Kaz Kylheku	2009-11-20	1	-22/+22
\| \| \| \| \|	we wouldn't have to declare object variables at all, so why use an obtuse syntax to do so?)
*	Following-up on diagnostics obtained by running code through C++	Kaz Kylheku	2009-11-18	1	-8/+8
\| \| \| \| \| \|	compiler. Idea: allocator functions return char * instead of void , like malloc did in classic pre-ANSI C. That way we are forced to use a cast except when the target pointer is char already.
*	Warning fixes.	Kaz Kylheku	2009-11-17	1	-1/+1
\|
*	* regex.c (nfa_all_states, nfa_closure): visited parameter	Kaz Kylheku	2009-11-17	1	-2/+2
\| \| \| \|	should be unsigned.
*	Regular expression module updated to do unicode character sets.	Kaz Kylheku	2009-11-12	1	-49/+433
\| \| \| \| \| \| \| \| \| \| \|	Most of the changes are in the area of representing sets. Also, a bug was found in the compilation of regex character sets: ranges straddling two adjacent blocks of 32 characters were not being added to the character set. However, ranges falling within a single 32 block, or spanning three or more such blocks, worked properly. This bug is not tickled by common ranges such as A-Z, or 0-9, which land within a 32 block.
*	Big conversion to wide characters and UTF-8 support.	Kaz Kylheku	2009-11-11	1	-3/+3
\| \| \| \| \| \| \| \| \|	This is incomplete. There are too many dependencies on wide character support from the C stream I/O library, and implicit use of some encoding which may not be UTF-8. The regex code does not handle wide characters properly. Character type is still int in some places, rather than wchar_t. Test suite passes though.
*	Version 019txr-019	Kaz Kylheku	2009-11-03	1	-7/+7
\| \| \| \| \| \|	Regexps can be bound to variables. New freeform directive.
*	Got regex working over lazy strings from freeform.	Kaz Kylheku	2009-11-02	1	-25/+82
\| \| \| \|	Bugfixes.
*	Start of implementation for freestyle matching.	Kaz Kylheku	2009-11-02	1	-0/+76
\| \| \| \| \| \| \| \| \| \| \|	Lazy strings implemented, incompletely. Changed string function to implicitly strdup; non-strdup version changed to string_own. Fixed wrong uses of strdup rather than chk_strdup. Functions added to regex module to provide regex matching as a state machine to which characters are fed.
*	Trivial change allows regexps to be bound to variables,	Kaz Kylheku	2009-10-30	1	-0/+5
\| \| \| \| \|	and used for matching. This Just Works because of the way match_line treats variables.
*	txr-015 2009-10-15txr-015	Kaz Kylheku	2017-07-31	1	-7/+10
\|
*	txr-011 2009-09-25txr-011	Kaz Kylheku	2017-07-31	1	-0/+631