summaryrefslogtreecommitdiffstats
path: root/regex.c
Commit message (Collapse)AuthorAgeFilesLines
* * eval.c (eval_init): Expose regex-compile and regexp as intrinsics.Kaz Kylheku2012-04-101-0/+5
| | | | | | | | | | | | | * lib.c (obj_init): Change spelling of nongreedy operator and put it into the user package so that it is available for use with regex-compile. * regex.c (match_regex, search_regex): Bugfix: optional start position argument argument not defaulting to zero. * txr.1: Documented regex-compile and regexp. * txr.vim: Highlighting regex-compile and regexp.
* Changing type function to not blow up on nil, which makes a lot of codeKaz Kylheku2012-03-171-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | simpler. A pseudo type code is introduced called NIL with value 0. * lib.h (enum type): New enumeration value, NIL. (type): Function accepts object nil and maps it to code NIL. * eval.c (dwim_loc, op_dwim): test for nil obj and goto hack is gone, just handle NIL in the switch. * gc.c (make_obj, mark): Handle new NIL type code in switch. * hash.c (equal_hash): Handle NIL in the switch instead of nil test. * lib.c (code2type): Map new NIL type code to null. (typeof, typecheck): Code simplified. (class_check, car): Move nil test into switch. (eql, equal, consp, bignump, stringp, lazy_stringp, symbolp, functionp, vectorp, cobjp): Simplified. (length, sub, ref, refset, replace, obj_print, obj_pprint): Handle NIL in switch instead of nil test. goto hack removed from refset. * match.c (do_match_line, do_output_line): switch condition simplified. * regex.c (regexp): Simplified. (regex_nfa): Assert condition simplified.
* * regex.c (regsub): the replacement argumentKaz Kylheku2012-03-131-1/+4
| | | | | | can now be a function of one argument which maps the original piece of text matched by the regex to a replacement text.
* Bug #35718. Workaround good enough to get some code working.Kaz Kylheku2012-03-041-1/+1
| | | | | | | | | | | | * eval.c (cons_find): New function. (expand_op): Use cons_find rather than tree_find to look for rest_gensym. * regex.c (regsub): Rearranged arguments so that the string is last. This is better for partial evaluaton via the op operator. * regex.h (regsub): Updated declaration.
* * eval.c (eval_init): New intrinsic function, regsub.Kaz Kylheku2012-03-041-0/+28
| | | | | | | | * regex.c (regsub): New function. * regex.h (regsub): Declared. * txr.1: Doc stub added.
* * arith.c: Updated copyright year.Kaz Kylheku2012-02-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * arith.h: Likewise. * debug.c: Added copyright header. * debug.h: Updated copyright year. * eval.c: Likewise. * eval.h: Likewise. * filter.c: Likewise. * filter.h: Likewise. * gc.c: Likewise. * gc.h: Likewise. * hash.c: Likewise. * hash.h: Likewise. * lib.c: Likewise. * lib.h: Likewise. * match.c: Likewise. * match.h: Likewise. * parser.h: Likewise. * regex.c: Likewise. * regex.h: Likewise. * stream.c: Likewise. * stream.h: Likewise. * txr.c: Likewise, and e-mail address. * txr.h: Updated copyright year. * unwind.c: Likewise. * unwind.h: Likewise.
* We don't include headers in headers in this project.Kaz Kylheku2011-10-301-0/+1
| | | | | | | | * parser.h: Do not include <stdio.h> * regex.c: Include <limits.h> * regex.h: Do not include <limits.h>
* Improved support for broken unicode.Kaz Kylheku2011-10-101-1/+38
| | | | | | | | | | | | | | | | | | | | | | | | | Regex support for extra-large character sets not compiled in if wchar_t is not wide enough for it. The utf-8 properly throws exceptions when encountering characters that it cannot represent, instead of silently ignoring the situation and continuing with incorrectly computed data. * regex.c (FULL_UNICODE): New macro. (CHAR_SET_L3, CHAR_SET_L2_LO, CHAR_SET_L2_HI): Only defined if full unicde is available. (CHSET_XLARGE, cset_L3_t, struct xlarge_char_set, L2_full, L3_fill_range, L3_contains): Ditto. (unon char_set): Member x1 present only under FULL_UNICODE. (char_set_destroy, char_set_add, char_set_add_range, char_set_contains): CHSET_XLARGE cases only available on FULL_UNICODE. (char_set_compile): Default cst variable to CHSET_LARGE. * utf8.c (FULL_UNICODE): New macro. (conversion_error): New function. (utf8_from_uc): Throw error if not FULL_UNICODE and character is outside the BMP. (utf8_decode): Likewise.
* * LICENSE, Makefile, configure, filter.c, filter.h, gc.c, gc.h, hash.c,Kaz Kylheku2011-10-041-1/+1
| | | | | | hash.h, lib.c, lib.h, match.c, match.h, parser.h, parser.l, parser.y, regex.c, regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h: Updated e-mail address.
* * LICENSE, Makefile, configure, gc.c, gc.h, hash.c, hash.h, lib.c,Kaz Kylheku2011-09-231-1/+1
| | | | | | lib.h, match.c, match.h, parser.h, parser.l, parser.y, regex.c, regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h: Updated copyright year.
* Bump copyrights to 2010.Kaz Kylheku2010-10-051-1/+1
|
* Fix inaccurate comment.Kaz Kylheku2010-01-261-4/+4
|
* Optimization in derivative-based regex engine.Kaz Kylheku2010-01-261-1/+54
| | | | | | | | Exponential memory consumption behavior was observed when matching the input aaaaaa.... against the regex a?a?a?a?....aaaa.... The fix is to eliminate common subexpressions from the derivative for the or operator.
* * regex.c (reg_derivative_list, reg_derivative): RecognitionKaz Kylheku2010-01-181-6/+29
| | | | | | | | | of cases to reduce consing. In reg_derivative_list, we avoid consing the full or expression if either branch is t, and also save a cons when the first element has a null derivative. In reg_derivative the oneplus and zeroplus cases are split, since zeroplus can re-use the input expression, when it's just a one-character match, deriving nil.
* Adjust semantics of non-greedy operator R%S, to avoid the brokenKaz Kylheku2010-01-181-3/+9
| | | | | | | | case whereby R%S matches nothing at all when S is not empty but equivalent to empty, or more generally when S is nullable. A much nicer definition is ``the intersection of R* and the set of all strings that do not contain a non-empty substring that matches S, followed by S''.
* Implemented non-greedy operator.Kaz Kylheku2010-01-151-1/+20
|
* * regex.c (reg_derivative_list): Bugfix: wrong algebra,Kaz Kylheku2010-01-151-1/+1
| | | | taking a double derivative of the first item.
* * regex.c (reg_derivative): Bugfix: remove invalidKaz Kylheku2010-01-141-9/+1
| | | | algebraic reductions in the derivative for the operator.
* Dynamically determine which regex implementation to use:Kaz Kylheku2010-01-131-2/+30
| | | | | | | NFA or derivatives. The default behavior is NFA, with derivatives used if the regular expression contains uses of complement or intersection. The --dv-regex option forces derivatives always.
* Impelement derivative-based regular expressions.Kaz Kylheku2010-01-131-248/+557
|
* Remove incorrect implementation of extendedKaz Kylheku2010-01-061-273/+32
| | | | | regex operations (complement, intersection). The syntax extensions documentation are retained.
* Implemented the regular expression ~ and & operators.Kaz Kylheku2010-01-051-32/+273
| | | | | | | | | | | | | | | This turns out to be easy to do in NFA land. The complement of an NFA has exactly the same number and configuration of states and transitions, except that the states have an inverted meaning; and furthermore, failed character transitions are routed to an extra state (which in this impelmentation is permanently allocated and shared by all regexes). The regex & is implemented trivially using DeMorgan's. Also, bugfix: regular expressions like A|B|C are allowed now by the syntax, rather than constituting syntax error. Previously, this would have been entered as (A|B)|C.
* All COBJ operations have default implementations now;Kaz Kylheku2009-12-081-6/+5
| | | | | | no null pointer check over struct cobj_ops operations. New typechecking function for COBJ objects.
* Eliminate the void * disease. Generic pointers are of mem_t *Kaz Kylheku2009-12-041-1/+1
| | | | | from now on, which is compatible with unsigned char *. No implicit conversion to or from this type, in C or C++.
* Code cleanup. All private functions static. Private stuffKaz Kylheku2009-11-281-36/+136
| | | | in regex module not exposed in header. Etc.
* Changes to make the code portable to C++ compilers, whichKaz Kylheku2009-11-241-9/+9
| | | | can be taken advantage of for better diagnostics.
* Renaming global variables that denote symbols, such that theyKaz Kylheku2009-11-241-16/+16
| | | | have a _s suffix.
* Improving portability. It is no longer assumed that pointersKaz Kylheku2009-11-231-5/+6
| | | | | | | | can be converted to a type long and vice versa. The configure script tries to detect the appropriate type to use. Also, some run-time checking is performed in the streams module to detect which conversions specifier strings to use for printing numbers.
* Introducing symbol packages. Internal symbols are now inKaz Kylheku2009-11-211-1/+2
| | | | | | | | | | a system package instead of being hacked with the $ prefix. Keyword symbols are provided. In the matcher, evaluation is tightened up. Keywords, nil and t are not bindeable, and errors are thrown if attempts are made to bind them. Destructuring in dest_bind is strict in the number of items. String streams are exploited to print bindings to objects that are not strings or characters. Numerous bugfixes.
* Changing ``obj_t *'' occurences to a ``val'' typedef. (Ideally,Kaz Kylheku2009-11-201-22/+22
| | | | | we wouldn't have to declare object variables at all, so why use an obtuse syntax to do so?)
* Following-up on diagnostics obtained by running code through C++Kaz Kylheku2009-11-181-8/+8
| | | | | | compiler. Idea: allocator functions return char * instead of void *, like malloc did in classic pre-ANSI C. That way we are forced to use a cast except when the target pointer is char * already.
* Warning fixes.Kaz Kylheku2009-11-171-1/+1
|
* * regex.c (nfa_all_states, nfa_closure): visited parameterKaz Kylheku2009-11-171-2/+2
| | | | should be unsigned.
* Regular expression module updated to do unicode character sets.Kaz Kylheku2009-11-121-49/+433
| | | | | | | | | | | Most of the changes are in the area of representing sets. Also, a bug was found in the compilation of regex character sets: ranges straddling two adjacent blocks of 32 characters were not being added to the character set. However, ranges falling within a single 32 block, or spanning three or more such blocks, worked properly. This bug is not tickled by common ranges such as A-Z, or 0-9, which land within a 32 block.
* Big conversion to wide characters and UTF-8 support.Kaz Kylheku2009-11-111-3/+3
| | | | | | | | | This is incomplete. There are too many dependencies on wide character support from the C stream I/O library, and implicit use of some encoding which may not be UTF-8. The regex code does not handle wide characters properly. Character type is still int in some places, rather than wchar_t. Test suite passes though.
* Version 019txr-019Kaz Kylheku2009-11-031-7/+7
| | | | | | Regexps can be bound to variables. New freeform directive.
* Got regex working over lazy strings from freeform.Kaz Kylheku2009-11-021-25/+82
| | | | Bugfixes.
* Start of implementation for freestyle matching.Kaz Kylheku2009-11-021-0/+76
| | | | | | | | | | | Lazy strings implemented, incompletely. Changed string function to implicitly strdup; non-strdup version changed to string_own. Fixed wrong uses of strdup rather than chk_strdup. Functions added to regex module to provide regex matching as a state machine to which characters are fed.
* Trivial change allows regexps to be bound to variables,Kaz Kylheku2009-10-301-0/+5
| | | | | and used for matching. This Just Works because of the way match_line treats variables.
* txr-015 2009-10-15txr-015Kaz Kylheku2017-07-311-7/+10
|
* txr-011 2009-09-25txr-011Kaz Kylheku2017-07-311-0/+631