summaryrefslogtreecommitdiffstats
path: root/regex.c
Commit message (Collapse)AuthorAgeFilesLines
* Copyright year bump 2018.Kaz Kylheku2018-02-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * LICENSE, LICENSE-CYG, METALICENSE, Makefile, args.c, args.h, arith.c, arith.h, buf.c, buf.h, cadr.c, cadr.h, combi.c, combi.h, configure, debug.c, debug.h, eval.c, eval.h, ffi.c, ffi.h, filter.c, filter.h, ftw.c, ftw.h, gc.c, gc.h, glob.c, glob.h, hash.c, hash.h, itypes.c, itypes.h, jmp.S, lib.c, lib.h, lisplib.c, lisplib.h, match.c, match.h, parser.c, parser.h, parser.l, parser.y, protsym.c, rand.c, rand.h, regex.c, regex.h, share/txr/stdlib/awk.tl, share/txr/stdlib/build.tl, share/txr/stdlib/cadr.tl, share/txr/stdlib/conv.tl, share/txr/stdlib/doloop.tl, share/txr/stdlib/error.tl, share/txr/stdlib/except.tl, share/txr/stdlib/ffi.tl, share/txr/stdlib/getopts.tl, share/txr/stdlib/getput.tl, share/txr/stdlib/hash.tl, share/txr/stdlib/ifa.tl, share/txr/stdlib/keyparams.tl, share/txr/stdlib/op.tl, share/txr/stdlib/package.tl, share/txr/stdlib/path-test.tl, share/txr/stdlib/place.tl, share/txr/stdlib/pmac.tl, share/txr/stdlib/socket.tl, share/txr/stdlib/stream-wrap.tl, share/txr/stdlib/struct.tl, share/txr/stdlib/tagbody.tl, share/txr/stdlib/termios.tl, share/txr/stdlib/txr-case.tl, share/txr/stdlib/type.tl, share/txr/stdlib/with-resources.tl, share/txr/stdlib/with-stream.tl, share/txr/stdlib/yield.tl, signal.c, signal.h, socket.c, socket.h, stream.c, stream.h, struct.c, struct.h, strudel.c, strudel.h, sysif.c, sysif.h, syslog.c, syslog.h, termios.c, termios.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h, win/cleansvg.txr: Extended Copyright line to 2018.
* regex: bugfix: squash duplicates in move set.Kaz Kylheku2017-09-131-2/+1
| | | | | | | | * regex.c (nfa_move_closure): The move set calculation is wrongly assuming that all of the states are new and not testing their visited color. This could result in the same state being added twice. Though harmless, it wastefully inflates the set size.
* regex: factor out repeated visit-coloring pattern.Kaz Kylheku2017-09-131-13/+15
| | | | | | | * regex.c (nfa_test_set_visited): New inline function. (nfa_map_states, nfa_thread_epsilons, nfa_closure, nfa_move_closure): Use function instead of coding pattern which tests the state and sets the visited member.
* regex: re-introduce nfa_accept states.Kaz Kylheku2017-09-131-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The nfa_accept state label is re-introduced. This state type has the same representation as nfa_empty; essentially, this replaces the flag. This makes the state type smaller, and we don't have to access the flag to tell if a state is an acceptance state. * regex.c (nfa_kind_t): New enum label, nfa_accept. (struct nfa_state_empty): Member accept removed. (nfa_accept_state_p): Macro tests only for nfa_accept type. (nfa_empty_state_p): New macro. (nfa_state_accept): Set type of new state to nfa_accept; do not set accept flag. (nfa_state_empty): Do not set accept flag. (nfa_state_empty_convert): Do not clear accept flag. (nfa_map_states): Handle nfa_accept in switch, in the same case as nfa_empty. (nfa_thread_epsilons): Don't test for accept state in nfa_empty case; it would be always false now. Add nfa_accept case to switch which only arranges for a traversal of the two transitions. (Though these are expected to be null at the stage of the graph when this function is applied). (nfa_fold_accept): Switch type to nfa_accept rather than setting accept flag. (nfa_closure, nfa_move_closure): Use new macro for testing whether a state is empty.
* regex: retain unoptimized form for printing.Kaz Kylheku2017-09-121-5/+1
| | | | | | | regex.c (regex_compile): Take the source code to be the original code, rather than the version with AST-level optimizations and expansions related to the nongreedy operator.
* regex: bug printing #/abc(def|ghi)/Kaz Kylheku2017-09-121-1/+1
| | | | | | | | | | | | This was broken by the July 16 commit "regex: don't print superfluous parens around classes", 2411f779f47c441659720ad0ddcabf91df1d2529. * regex.c (print_rec): If an (or ...) appears as a compound element, it must be rendered in parentheses; or_s must be handled here just like and_s. Prior to the faulty commit, this was implicitly true because the logic was inverted and wasn't ruling out or_s.
* regex: accept-folding optimization.Kaz Kylheku2017-09-121-0/+23
| | | | | | | | | | | | In this optimization, we identify places in the NFA graph where empty states transition to accept states. We eliminate these transitions and turn the empty states into accept states. Accept states which thereby become unreachable from the start state are pruned away. * regex.c (nfa_fold_accept): New static function. (nfa_optimize): Add a pass which applies nfa_fold_accept to all states.
* regex: eliminate nfa_accept state type.Kaz Kylheku2017-09-121-20/+26
| | | | | | | | | | | | | | | | | | | | | | | | | Acceptance states are instead represented by adding an accept flag to nfa_empty states. This will support an optimization which eliminates accept states that are referenced only by empty states. * regex.c (nfa_kind_t): Enum member nfa_accept removed. (struct nfa_state_accept): Renamed to nfa_state_any. (struct nfa_state_empty): New member, accept. (union nfa_state): Member a changes from struct nfa_state_accept to struct nfa_state_any. (nfa_accept_state_p): New macro. (nfa_state_accept): Now makes an nfe_empty type state, with no transitions out and the accept flag set. (nfa_state_empty): Initialize accept flag to zero. (nfa_state_empty_convert): Set the accept flag to zero. (nfa_state_merge): Use new macro in assertion. (nfa_map_states): Remove nfa_accept switch case. (nfa_thread_epsilons): We must not eliminate an empty state which is an acceptance state, even if it meets the other conditions. (nfa_closure): Use a local variable to eliminate repetition of set[i] expression. Test for accept state using new macro. (nfa_move_closure): Test for accept state using new macro.
* regex: epsilon-threading optimization on NFA graph.Kaz Kylheku2017-09-121-1/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is something done for the sake of faster NFA simulation. I think it's glossed over in regex literature because it makes no difference in a NFA to DFA transformation. Fewer states in the graph means less work in the nfa_move_closure function at regex run time. In the NFA graph, there can occur empty states which are useless. These are states which hold only one epsilon transition. Those states can be replaced by whatever state their epsilon transition points to. The terminology is inspired by "jump threading" in code optimization, whereby if a branch takes place to an instruction which holds an unconditional branch, the original branch can be retargetted to go to the ultimate target. I.e. we turn: x e S0 ---> S1 ----> S2 where e denotes an epsilon transition, and x represents any kind of transition to: x S0 ---> S2 We can also eliminate empty states which have no transition; those can be replaced by a null pointer. I suspect these do not actually occur. * regex.c (nfa_thread_epsilons, nfa_noop, nfa_optimize): New static functions. (regex_compile): Invoke nfa_optimize on the results of nfa_compile_regex.
* regex: elide needless increments of visited counter.Kaz Kylheku2017-09-121-5/+5
| | | | | | | | | | | | | * regex.c (nfa_run): Start with the existing value of the visited counter and pre-increment it when calling nfa_closure and nfa_move_closure. Thus it is incremented only as many times as those functions are called: one fewer than before. (regex_machine_reset): regm->n.visited is already incremented, since 1 was added to the prior value; there is no need to increment it again. (nfa_handle_wraparound): More prudent approach here. If we get within 8 increments of wraparound, we fast forward to UINT_MAX and then 0.
* regex: bugfix: incorrect use of nfa_move_closure.Kaz Kylheku2017-09-121-1/+1
| | | | | | | | | | | | | | This goes back to November 2, 2009 commit, "Start of implementation for freestyle matching", 6191fbb2ca7a9ac339dd3994bdea8273ceb0a24d It is exposed if we perform an epsilon threading optimization on the NFA graph. * regex.c (regex_machine_feed): We must pass the new, incremented value of the visited counter, to nfa_move_closure. States have already been tagged with the old value before the call to regex_machine_feed, so we risk failing to visit some states and include them in the closure.
* regex: new function, regex-prefix-match.Kaz Kylheku2017-09-111-0/+48
| | | | | | | | | | | | | | | | | This new function allows a program to determine whether a given string is the prefix of any of the strings denoted by a regular expression; or, in alternative words, whether a given string is the prefix of a possibly longer string which matches a regular expression. * regex.c (regex_machine_infer_init_state): New static function. (regex_prefix_match): New function. (regex_init): regex-prefix-match intrinsic registered. regex.h (regex_prefix_match): Declared. * txr.1: Documented.
* regex: remove nfa_reject representation.Kaz Kylheku2017-09-111-40/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reject states are no longer explicitly represented in the NFA graph. Any transition that previously would have led to a reject state is just a null pointer now: no transition. This just means we have to check one place for a null pointer. This change fixes a flaw in nfa_closure: the function was propagating nfa_reject states to the closure set. When the input set consisted only of reject states (or rather exactly one reject state in that special case), it would add it to the output set and return 1, making it seem as if the regex is one that can potentially match something. * regex.c (nfa_kind_t): Enum member nfa_reject removed. (nfa_state_reject): Static function removed. (nfa_compile_regex): Compile a t regex (match nothing) to a NFA graph with no transition at all into any start state; effectively an empty graph with no state nodes. (nfa_map_states): Remove nfa_reject case from switch. Also, stylistic change; state pointers are tested as Booleans elsewhere rather than by comparison to null pointer. (nfa_count_states, nfa_handle_wraparound): Handle null state pointer. (nfa_move_closure): Test the mov transition for null; anything can potentially now have a null transition, not only epsilon nodes. (nfa_run, regex_machine_reset): Handle nfa.start being null indicating an empty NFA graph. In that case we don't have to calculate the closure; we just have an empty set of states and set nfa.nclos to zero. The visited disambiguation counter is irrelevant since there are no states to visit. (regex_machine_feed): Don't try to propagate visited counter stored in regex machine to empty NFA graph which has no states.
* regex: don't print superfluous parens around classes.Kaz Kylheku2017-07-161-2/+2
| | | | | | | | * regex.c (print_rec): Switching from negative tests to positive: print parens around elements of a compound which have a lower precedence. The previously incorrectly omitted set and cset remain unmentioned, so under this inversion of logic, they print without parentheses.
* cobj: rename poorly named default operation.Kaz Kylheku2017-05-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Renaming cobj_hash_op to cobj_eq_hash_op. This function is only appropriate to use with COBJ objects which use eq as their equal funtion. I've spotted one instance of an inappropriate use which have to be addressed by a different commit: the equal function is other than eq, but cobj_hash_op is used for the equal hash. * lib.h (cobj_hash_op): Declaration renamed to cobj_eq_hash_op. * hash.c (cobj_hash_op): Renamed to cobj_eq_hash_op. (hash_iter_ops): Refer to renamed cobj_hash_eq_op. * ffi.c (ffi_type_builtin_ops, ffi_type_struct_ops, ffi_type_ptr_ops, ffi-closure_ops, ffi_call_desc_ops): Likewise. * lib.c (cptr_ops): Likewise. * parser.c (parser_ops): Likewise. * rand.c (random_state_ops): Likewise. * regex.c (char_set_ops, regex_obj_ops): Likewise. * socket.c (dgram_strm_ops): Likewise. * stream.c (null_ops, stdio_ops, tail_ops, pipe_ops, dir_ops, string_in_ops, byte_in_ops, strlist_in_ops, string_out_ops, strlist_out_ops, cat_stream_ops, record_adapter_ops): Likewise. * struct.c (struct_type_ops): Likewise. * sysif.c (cptr_dl_ops): Likewise. * syslog.c (syslog_strm_ops): Likewise. * unwind.c (cont_ops): Likewise.
* Rename badly named default_bool_argKaz Kylheku2017-03-171-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * lib.h (default_bool_arg): Inline function renamed to default_null_arg. * eval.c (if_fun, pad, ginterate, giterate, range_star, range, constantp, macroexpand_1, macro_form_p, expand_with_free_refs, do_expand, eval_intrinsic, func_get_name, make_env_intrinsic): Follow rename. * arith.c (lognot): Likewise. * gc.c (gc_finalize): Likewise. * glob.c (glob_wrap): Likewise. * hash.c (group_reduce, gethash_n): Likewise. * lib.c (print, multi_sort, lazy_str, vector, iff, tok_str, split_str_keep, search_str, remove_if, val): Likewise. * match.c (match_fun): Likewise. * parser.c (lisp_parse_impl, regex_parse): Likewise. * rand.c (make_random_state): Likewise. * regex.c (read_until_match, search_regex, regex_compile): Likewise. * socket.c (sock_accept, sock_connect): Likewise. * stream.c (open_files_star, open_files, run, open_process, open_tail, get_string, record_adapter): Likewise. * struct.c (static_slot_ensure, static_slot_ens_rec, clear_struct, make_struct_type): Likewise. * sysif.c (exec_wrap, errno_wrap, cobj_ops_init): Likewise. * unwind.c (uw_capture_cont, uw_find_frames_impl): Likewise.
* Fix missing nao terminator in formatted printing.Kaz Kylheku2017-03-131-1/+1
| | | | | | | | | | | | | | * arith.c (trunc1, trunc, floorf, ceili): Add missing nao terminator to uw_throwf calls. * debug.c (debug): Missing nao terminator in format call. * eval.c (expand_opt_params_rec, me_equot): Missing nao terminator in eval_error call. * lib.c (use_sym): Missing nao in uw_throw call. * regex.c (reg_derivative): Missing nao in uw_throwf.
* Bump copyright year to 2017.Kaz Kylheku2017-01-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | * LICENSE, LICENSE-CYG, METALICENSE, Makefile, args.c, args.h, arith.c, arith.h, cadr.c, cadr.h, combi.c, combi.h, configure, debug.c, debug.h, eval.c, eval.h, filter.c, filter.h, ftw.c, ftw.h, gc.c, gc.h, glob.c, glob.h, hash.c, hash.h, jmp.S, lib.c, lib.h, lisplib.c, lisplib.h, match.c, match.h, parser.c, parser.h, parser.l, parser.y, rand.c, rand.h, regex.c, regex.h, signal.c, signal.h, stream.c, stream.h, struct.c, struct.h, sysif.c, sysif.h, syslog.c, syslog.h, termios.c, termios.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h, share/txr/stdlib/awk.tl, share/txr/stdlib/build.tl, share/txr/stdlib/cadr.tl, share/txr/stdlib/conv.tl, share/txr/stdlib/except.tl, share/txr/stdlib/getopts.tl, share/txr/stdlib/getput.tl, share/txr/stdlib/hash.tl, share/txr/stdlib/ifa.tl, share/txr/stdlib/package.tl, share/txr/stdlib/path-test.tl, share/txr/stdlib/place.tl, share/txr/stdlib/socket.tl, share/txr/stdlib/struct.tl, share/txr/stdlib/tagbody.tl, share/txr/stdlib/termios.tl, share/txr/stdlib/txr-case.tl, share/txr/stdlib/type.tl, share/txr/stdlib/with-resources.tl, share/txr/stdlib/with-stream.tl, share/txr/stdlib/yield.tl: Add 2017 to all copyright headers and strings.
* Fix inconsistency in regex-source.Kaz Kylheku2016-12-251-2/+8
| | | | | | | | | | | | | | | | If we compile the regex expression (compound "str*"), calling regex-source on the compiled regex object yields "str*". That, of course, is treated as regex character syntax if fed back to regex-compile, and the * becomes an operator. We want the source to be (compound "str*"). This happens because the AST optimizer reduces (compound X) -> X. * regex.c (regex_compile): If the optimized expression is just a character string atom S, then for the purposes of maintaining the source code, convert it to (compound S).
* Adding functions fr^$, fr^, fr$ and frr.Kaz Kylheku2016-12-011-0/+28
| | | | | | | | * regex.c (regex_range_full_fun, regex_range_left_fun, regex_range_right_fun, regex_range_search_fun): New functions. (regex_init): Register fr^$, fr^, fr$ and frr intrinsics. * txr.1: Documented.
* Add stream printing context.Kaz Kylheku2016-10-201-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is some infrastructure which will support *print-circle*. * lib.h (struct strm_ctx): Forward declared. (struct cobj_ops): Add context parameter to print function pointer. (cobj_print_op, obj_print_impl): Add context parameter to declarations. * hash.c (hash_print_op): Take context argument and pass it down in obj_print_impl calls. * lib.c (cobj_print_op, out_quasi_str): Likewise (obj_print_impl): Likewise, and also pass to COBJ print method. (obj_print, obj_pprint): Pass null pointer as context argument to obj_print_impl. * regex.c (regex_print): Take context parameter and ignore it. * socket.c (dgram_print): Likewise. * stream.h (struct strm_ctx): New struct type. (struct strm_base): New ctx member, pointer to struct strm_ctx. (stream_print_op): Add context parameter to declaration. (get_set_ctx, get_ctx): Declared. * stream.c (strm_base_init): Add null pointer to initializer. (strm_base_cleanup): Add assertion against context pointer being non-null: that indicates that some stream operation installed a context pointer and neglected to restore it to null before returning, which is bad because context will be stack allocated. (stream_print_op, stdio_stream_print, cat_stream_print): Take context parameter and ignore it. (get_set_ctx, get_ctx): New functions. * struct.c (struct_type_print): Take context parameter and ignore it. (struct_inst_print): Take context parameter and pass down to obj_print_impl.
* Support n-ary and and or operators in regex.Kaz Kylheku2016-10-101-1/+63
| | | | | | | | | | | Since much regex code assumes these are binary, the easiest and briefest approach is to implement a code transformation pass which rewrites n-ary forms into binary. * regex.c (reg_nary_unfold, reg_nary_to_bin): New functions. (regex_compile): Put raw sexp through reg_nary_to_bin to expand the nary syntax.
* Simplify some regex tree walking code.Kaz Kylheku2016-10-101-18/+10
| | | | | | | | | | * regex.c (reg_expand_nongreedy, reg_compile_csets): Generalize the compound_s case slightly by referring to sym rather than hard-coded compound_s. Then handle most of the regex operators under this same case. Their semantics are not relevant to the expansions being performed in these functions: all their arguments are regexes to be recursed over.
* New function rra.Kaz Kylheku2016-10-031-0/+50
| | | | | | | | | * regex.c (range_regex_all, regex_range_all): New functions. (regex_init): Register rra intrinsic function. * regex.c (range_regex_all, regex_range_all): Declared. * txr.1: Documented rra.
* New rr function.Kaz Kylheku2016-10-031-0/+12
| | | | | | | | | | * regex.c (regex_range_search): New function. (regex_init): Register regex_range_search as rr intrinsic. * regex.h (regex_range_search): Declared. * txr.1: Documented rr, and added reference to it in description of regex-range.
* search-regex improvement: negative start and more.Kaz Kylheku2016-10-031-40/+52
| | | | | | | | | | | | * regex.c (search_regex): Handle negative starting positions according to the convention elsewhere and fail excessively negative ones. Consistently fail on starting positions exceeding the length of the string. Handle zero length matches by reporting them against the start position or position one past the last character, based on the value of from-end. * txr.1: search-regex documentation updated.
* Synchronize license comments with LICENSE.Kaz Kylheku2016-10-011-16/+17
| | | | | | | | | | | | | | | | | | | | * Makefile, args.c, args.h, arith.c, arith.h, cadr.c, cadr.h, combi.c, combi.h, configure, debug.c, debug.h, eval.c, eval.h, filter.c, filter.h, ftw.c, ftw.h, gc.c, gc.h, glob.c, glob.h, hash.c, hash.h, jmp.S, lib.c, lib.h, lisplib.c, lisplib.h, match.c, match.h, parser.c, parser.h, parser.l, parser.y, rand.c, rand.h, regex.c, regex.h, share/txr/stdlib/awk.tl, share/txr/stdlib/build.tl, share/txr/stdlib/cadr.tl, share/txr/stdlib/conv.tl, share/txr/stdlib/except.tl, share/txr/stdlib/hash.tl, share/txr/stdlib/ifa.tl, share/txr/stdlib/path-test.tl, share/txr/stdlib/place.tl, share/txr/stdlib/socket.tl, share/txr/stdlib/struct.tl, share/txr/stdlib/termios.tl, share/txr/stdlib/txr-case.tl, share/txr/stdlib/type.tl, share/txr/stdlib/with-resources.tl, share/txr/stdlib/with-stream.tl, share/txr/stdlib/yield.tl, signal.c, signal.h, socket.c, socket.h, stream.c, stream.h, struct.c, struct.h, sysif.c, sysif.h, syslog.c, syslog.h, termios.c, termios.h, txr.1, txr.c, txr.h, unwind.c, unwind.h, utf8.c, utf8.h: Revert to verbatim 2-Clause BSD.
* Flurry of regex bugfixes.Kaz Kylheku2016-09-251-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | * regex.c (match_regex): Bail if pos is too positive, beyond length of string. (match_regex_right): Include the pos == end case in the iteration, so we can match an empty suffix of the string. The inner loop guard takes care of not feeding any characters from the string into the regex machine in this case; we just feed the terminating zero to get the final state. (match_regst): Normalize a negative pos, otherwise the sub_str calculation will be junk, since match_regex returns a normalized position. After normalizing, check that if the position is still negative, the match must fail. (match_regst_right_old, match_regst_right): Use zero rather than t as the range end in sub_str. That way if len is zero and neg(len) produces zero, an empty string will be sliced out. For negative values, the zero serves as one position beyond the last char, just like t. (do_match_full_offs, regex_match_full, regex_range_full, regex_range_left): Fail match if normalized starting pos is negative. (regex_range_right): Fix completely bogus calculation of the returne range in the case when the end position defaults to the string length.
* regex.c: code formatting.Kaz Kylheku2016-09-251-1/+1
| | | | * regex.c (puts_clear_flag): Fix bad indentation.
* New function: regex-source.Kaz Kylheku2016-09-251-0/+7
| | | | | | | | | * regex.c (regex_source): New function. (regex_init): regex-source intrinsic registered. * regex.h (regex_source): Declared. * txr.1: Documented.
* Bugfix in regex printing: & operator.Kaz Kylheku2016-09-251-1/+1
| | | | | * regex.c (print_rec): Fix checking arg1 for consp but accessing arg2.
* New regex functions: m^$, m^, m$, and others.Kaz Kylheku2016-09-231-0/+130
| | | | | | | | | | | | | | | | | | | * regex.c (do_match_full, do_match_full_offs, do_match_left, do_match_left_offs, do_match_right, do_match_right_offs): New static functions. (regex_match_full_fun, regex_match_right_fun, regex_match_full, regex_match_left, regex_match_right, regex_range_full, regex_range_left, regex_range_right): New functions. (regex_init): Register f^$, f^, f$, m^$, m^, m$, r^$, r^ and r$ intrinsics. * regex.h (regex_match_full_fun, regex_match_right_fun, regex_match_full, regex_match_left, regex_match_right, regex_range_full, regex_range_left, regex_range_right): Declared. * txr.1: Documented new functions.
* Semantics change in match-regex-right.Kaz Kylheku2016-09-221-3/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | The way the end-position argument works in match-regex-right and match-regst-right is poorly considered. It basically enforces a constraint that there is a match which ends at that position and does not go beyond. This patch changes it work right: the functions test that the regex matches up to that position, as if the string ended there. * regex.c (match_regex_right_old): New static function, identical to the previous match_regex_right. Since we won't ever be using this inside TXR from any other module, we don't make it external. (match_regex_right): Rewritten to new semantics. (match_regst_right_old): New static function; provides the semantics of the old match_regst_right based on match_regex_right_old. (regex_init): Register match-regex-right and match-regst-right intrinsics to the match_regex_right_old and match_regst_right_old functions if compatibility <= 150 is requested. Otherwise they go to the rewritten new functions. * txr.1: Documentation updated, and compat notes added.
* Fix match-regex not conforming to documentation.Kaz Kylheku2016-09-221-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The documentation says that match-regex returns the length. Actually, it returns the position after the last character matched. This makes a difference when the match doesn't begin at character zero. The actual behavior is that of the match_regex C function which has behaved that way since the dawn of TXR, and internals depend on it behaving that way. So the internal function is being retained, and a new function is being registered as the match-regex intrinsic. The choice of binding for match-regex is subject to the compatibility option. The behavior of match-regst is also being fixed since its return value is incorrect due to this issue. Since its return value makes no sense at all (does not represent the matched text), it is not subject to the compatibility option; it is just fixed to conform with the documentation. * regex.c (match_regex_len): New function. (match_regst): Keep using match_regex, but use its return value properly. This simplifies the range extraction code, which is why match_regex works that way in the first place. (regex_init): Register match-regex to match_regex_len, unless compatibility <= 150 is requested; then register to match_regex. * regex.h (match_regex_len): Declared. * txr.1: Compatibility notes added.
* Support functional argument in regsub.Kaz Kylheku2016-09-221-23/+41
| | | | | | | | | * regex.c (regsub): Allow the second argument to be a function, which is called with str as an argument, and returns a range which indicates what part of the string is to be replaced, or else nil. * txr.1: Documented functional argument of regsub.
* Support negative positions in regex matching funs.Kaz Kylheku2016-09-211-1/+9
| | | | | | | | | * regex.c (match_regex, match_regex_right): Detect a negative start or end position, respectively, and add the string length to it. If it is still negative, bail with nil. * txr.1: Documented.
* Move regex intrinsic registrations to regex.c.Kaz Kylheku2016-09-211-0/+14
| | | | | | | | * eval.c (eval_init): Remove all regex-related function registrations from here. * regex.c (regex_init): Move regex-related function registrations here.
* regex: optimize double complement.Kaz Kylheku2016-09-161-40/+46
| | | | * regex.c (reg_optimize): Implement ~~R -> R reduction.
* regex: add case to complement optimization.Kaz Kylheku2016-09-151-0/+2
| | | | | | | | * regex.c (reg_optimize): Based on the reasoning in the previous commit, we can also statically optimize a complement whose argument is the t regex: match nothing. We convert that to match everything: the .* regex. Now (regex-compile "~[]") -> #/.*/.
* regex: fix broken complement operator.Kaz Kylheku2016-09-151-1/+3
| | | | | | | | | | | | | | | | | | | The form (match-regex "xy" #/~ab/) should return 2 (full match) because "xy" is in the complement of the set { "ab" }. It wrongly returns 1. * regex.c (reg_derivative): Handle the case when the derivative of the complement's constituent expression yields nil. This means that the complemented regex matches the input. In this case, the complement must lapse to the .+ regex: match one or more characters. That is to say, if the input has at least one more character, there is a match, which covers all such characters. Otherwise there is no match: the input matches the complemented regex. In the t case, the return value is also wrong. If the complemented regex hits a brick wall (matches nothing, not even the empty string), the correct complement is "match everything": the .* regex. Not the match empty string regex!
* NFA regex optimization: use just one set array.Kaz Kylheku2016-07-191-48/+31
| | | | | | | | | | | | | | | | | | | | | | We don't have to flip between two arrays, since the nfa_closure and and nfa_move_closure can write the output set into the same array. * regex.c (struct nfa_machine): Replace flip and flop members with a single set. (nfa_closure, nfa_move_closure): out array parameter removed; in renamed to set. References to in and out simply replaced with set. (nfa_run): Allocate one set instead of two, plus the stack. Remove code to swap the two pointers on each iteration. (regex_machine_reset): Prepare initial closure in the one and only set array. (regex_machine_init): Allocate set array, rather than flip an flop. (regex_machine_cleanup): Free set array and null out pointer rather than flip and flop arrays. (regex_machine_feed): Pass just the set ot the nfa_move_closure function. Remove flip/flop pointer swapping
* NFA regex optimization: combine move and closure.Kaz Kylheku2016-07-191-37/+90
| | | | | | | | | | | | | | | | | | | | * regex.c (struct nfa_machine_t): Remove move and clos array pointers, replace with flip and flop. Remove nmove member. (nfa_move): Static function removed. (nfa_move_closure): New static function, based on nfa_move and logic from nfa_closure. (nfa_run): Use nfa_move_closure and flip between two arrays. (regex_machine_reset): Remove reference to nmove member in nfa_machine_t. Prepare initial closure in flip array. (regex_machine_init): Allocate flip and flop arrays, rather than removed move and clos. (regex_machine_cleanup): Free flip and flop arrays and zero out the pointers, rather than removed move and clos. (regex_machine_feed): Replace nfa_move and nfa_closure with combined nfa_move_closure from flip to flop, and exchange of flip and flop arrays.
* New --free-all option for freeing memory on exit.Kaz Kylheku2016-06-071-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although we are garbage-collected, being able to clean up on shutdown is nevertheless useful for uncovering leaks. Leaks can occur, for instance, due to neglect to free out-of-heap satellite data from objects that are reclaimed by gc. This feature is long overdue. * arith.c, arith.h (arith_free_all): New function. * gc.c, gc.h (gc_free_all): New function. * lib.c (init): Remove program name parameter and redundant initialization of progname globl variable. * lib.h (progname): Superfluous declaration removed. This is already declared in txr.h. (init): Declaration updated. * regex.c (char_set_destroy): Do not check the static allocation flag here; just destroy the object. Do check for a null pointer, though. (char_set_cobj_destroy): This cobj destructor now checks the static flag of the char set object and avoids freeing it. Thus our char set singletons are left alone by gc, but our global freeing function takes care of them. (wide_cs): New static variable moved out of wide_display_char_p to static scope. (regex_free_all): New function. * regex.h (regex_free_all): Declared. * txr.c (progname): const qualifier and initializer removed. (main): Ensure progname is always dynamically allocated, even in the argv[0] == 0 case. Do not pass progname to init; it doesn't take that argument any more. (free_all): New static function. (txr_main): Implement --free-all option. * txr.h (progname): Declaration updated.
* Some streamlining in the cons recycling.Kaz Kylheku2016-05-151-1/+1
| | | | | | | | | | | * lib.c (rcyc_pop): Just assume that *plist points to a cons and access the fields directly. (rcyc_cons): Don't bother with rplacd. (rcyc_list): Don't bother with set macro. * regex.c (read_until_match): Defensive coding: locally ensure that rcyc_pop won't be called on a nil stack, which will now segfault.
* Recycle conses in unget-char and read-until-match.Kaz Kylheku2016-04-201-3/+7
| | | | | | | | | | | | * regex.c (ead_until_match): Use rcyc_pop instead of pop to move the conses to the recycle list. We know these are not shared with anything. Adding additional logic to completely recycle the stack. * socket.c (dgram_get_char): Use rcyc_pop to get the character from the push-back list. * stream.c (stdio_get_char): Likewise.
* read-until-match can optionally keep matched text.Kaz Kylheku2016-04-201-21/+19
| | | | | | | | | | | | | | | | | | | | * regex.c (read_until_match): New argument, include_match. Three times repeated termination code refactored into block reached by forward goto. (regex_init): Registration of read-until-match updated. * regex.h (read_until_match): Declaration updated. * stream.c (struct record_adapter_base): New member, include_match. (record_adapter_get_line): Pass match to read_until_match as new argument. (record_adapater): New argument, include_match. (stream_init): Update registration of record-adapter. * stream.h (record_adapter): Declaration updated. * txr.1: Updated.
* Fix broken read_until_match.Kaz Kylheku2016-04-191-17/+51
| | | | | * regex.c (read_until_match): Completely rewrite broken, unsalvageable, garbage logic.
* Header file cleanup.Kaz Kylheku2016-01-221-1/+0
| | | | | | | * arith.c, cadr.c, debug.c, eval.c, filter.c, gencadr.txr, glob.c, hash.c, linenoise/linenoise.c, lisplib.c, match.c, parser.c, rand.c, regex.c, signal.c, stream.c, struct.c, sysif.c, syslog.c, txr.c, unwind.c, utf8.c: Remove unncessary header files.
* Regex printing not escaping [ and ].Kaz Kylheku2016-01-121-1/+2
| | | | | * regex.c (print_rec): Handle '[' and ']' in backslash-adding switch.
* Print control chars in regexes using \x.Kaz Kylheku2016-01-121-53/+70
| | | | | | | | | | | | | | | | | | | | * lib.c (out_str_char): Static function becomes extern. * lib.h (out_str_char): Declared. * regex.c (puts_clear_flag, putc_clear_flag): New static functions. (print_class_char): Take semicolon flag argument. Use out_str_char to render characters not escaped locally. Clear the semicolon flag. (paren_print_rec): Take semicolon flag argument, and pass it down. Clear it when printing parentheses. (print_rec): Take semicolon flag argument, and pass down to lower level functions. Use putc_clear_flag and puts_clear_flag instead of put_string and put_char. Use out_str_char for char object not esaped locally. (regex_print): define semi_flag and pass it down to print_rec.