| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
to reflect that it has two arguments now.
* parser.y (grammar): Update calls to regex_compile to
pass two arguments. Since we don't expect regex_compile to
parse, we specify the error stream as nil.
(spec): The "secret syntax" for a regex is simplified
not to include the slashes. This provides better diagnostics for
unterminated syntax and requires less string processing to generate.
Also, the form returned doesn't have the regex symbol
consed onto it, which parse_regex just throws away.
* regex.c (regex_compile): Now takes a stream argument.
* regex.h (regex_compile): Declaration updated.
* txr.1: Updated
|
|
|
|
|
|
|
| |
* regex.h (regex_compile): Don't call argument
regex_sexp, since it can be a string.
* txr.1: Updated.
|
|
|
|
|
|
|
|
|
| |
(char_set_addr_str): New function.
(char_set_compile): Use char_set_addr_str to
add spaces to set.
(init_special_char_sets): Use char_set_addr_str to
add spaces to set. Bugfix: word_cs, cword_cs wrongly initialized.
(regex_init): Removed reference to regex_space_chars.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(regterm): REGTOKEN production factored out to regtoken.
(regclass): Reverted prior commmit's changes.
(regclassterm): Reverted prior commit, removing REGTOKEN
production for character classes, and introduced a regtoken
production. So now the keyword symbols are part of the
character class abstract syntax.
(regtoken): New production rule.
* regex.c (regex_space_chars): Converted to internal linkage.
(char_set_compile): Handle token keywords in character class
abstract syntax.
* regex.h (regex_space_chars): External declaration removed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* lib.c (init): Call regex_init.
* parser.l: return new REGTOKEN kind.
* parser.y (REGTOKEN): New token type.
(REGTERM): Translate REGTERM to keyword.
(regclass): Restructured to handle inherited nodes as lists.
(regclassterm): Produce $$ as list. Add handling for REGTOKEN
occurring inside character class by expanding it. This might not
be the best approach.
(yybadtoken): Handle REGTOKEN in switch.
* regex.c (struct any_char_set, struct small_char_set,
struct displaced_char_set, struct large_char_set,
struct xlarge_char_set): New bitfield member, stat.
(char_set_create): New parameter for indicating static char set.
(char_set_destroy): Do not free a static char set.
(char_set_compile): Pass zero to new parameter of char_set_create.
(spaces): New static array.
(space_cs, digit_cs, word_cs, cspace_cs, cdigit_cs, cword_cs): New
static pointers to char_set_t.
(init_special_char_sets, nfa_compile_given_set): New static function.
(nfa_compile_regex, dv_compile_regex): Handle new character set token
keywords.
(space_k, digit_k, word_char_k, cspace_k, cdigit_k, cword_char_k,
regex_space_chars): New variables.
(regex_init): New function.
* regex.h (space_k, digit_k, word_char_k, cspace_k, cdigit_k,
cword_char_k, regex_space_chars, regex_init): Declared.
|
|
|
|
|
|
|
|
|
|
|
| |
character compounds. I.e. the syntax "foo" is equivalent to the
cumbersome canonical form (compound #\f #\o #\o).
* regex.c (nfa_compile_regex, dv_compile_regex): Use chrp function
instead of typeof. Handle stringp case by forming a compound out of the
characters and recursing. Check for some bad objects in the regex
that would never come out of our regex parser but could occur
in a "hand crafted" syntax tree.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* lib.c (obj_init): Change spelling of nongreedy operator and put
it into the user package so that it is available for use with
regex-compile.
* regex.c (match_regex, search_regex): Bugfix: optional start
position argument argument not defaulting to zero.
* txr.1: Documented regex-compile and regexp.
* txr.vim: Highlighting regex-compile and regexp.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
simpler. A pseudo type code is introduced called NIL with value 0.
* lib.h (enum type): New enumeration value, NIL.
(type): Function accepts object nil and maps it to code NIL.
* eval.c (dwim_loc, op_dwim): test for nil obj and goto hack is gone,
just handle NIL in the switch.
* gc.c (make_obj, mark): Handle new NIL type code in switch.
* hash.c (equal_hash): Handle NIL in the switch instead of nil test.
* lib.c (code2type): Map new NIL type code to null.
(typeof, typecheck): Code simplified.
(class_check, car): Move nil test into switch.
(eql, equal, consp, bignump, stringp, lazy_stringp,
symbolp, functionp, vectorp, cobjp): Simplified.
(length, sub, ref, refset, replace, obj_print, obj_pprint): Handle NIL
in switch instead of nil test. goto hack removed from refset.
* match.c (do_match_line, do_output_line): switch condition simplified.
* regex.c (regexp): Simplified.
(regex_nfa): Assert condition simplified.
|
|
|
|
|
|
| |
can now be a function of one argument which maps
the original piece of text matched by the regex
to a replacement text.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* eval.c (cons_find): New function.
(expand_op): Use cons_find rather than tree_find to look for
rest_gensym.
* regex.c (regsub): Rearranged arguments so that the string
is last. This is better for partial evaluaton via the op
operator.
* regex.h (regsub): Updated declaration.
|
|
|
|
|
|
|
|
| |
* regex.c (regsub): New function.
* regex.h (regsub): Declared.
* txr.1: Doc stub added.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* arith.h: Likewise.
* debug.c: Added copyright header.
* debug.h: Updated copyright year.
* eval.c: Likewise.
* eval.h: Likewise.
* filter.c: Likewise.
* filter.h: Likewise.
* gc.c: Likewise.
* gc.h: Likewise.
* hash.c: Likewise.
* hash.h: Likewise.
* lib.c: Likewise.
* lib.h: Likewise.
* match.c: Likewise.
* match.h: Likewise.
* parser.h: Likewise.
* regex.c: Likewise.
* regex.h: Likewise.
* stream.c: Likewise.
* stream.h: Likewise.
* txr.c: Likewise, and e-mail address.
* txr.h: Updated copyright year.
* unwind.c: Likewise.
* unwind.h: Likewise.
|
|
|
|
|
|
|
|
| |
* parser.h: Do not include <stdio.h>
* regex.c: Include <limits.h>
* regex.h: Do not include <limits.h>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Regex support for extra-large character sets not compiled in
if wchar_t is not wide enough for it.
The utf-8 properly throws exceptions when encountering characters
that it cannot represent, instead of silently ignoring the
situation and continuing with incorrectly computed data.
* regex.c (FULL_UNICODE): New macro.
(CHAR_SET_L3, CHAR_SET_L2_LO, CHAR_SET_L2_HI): Only defined
if full unicde is available.
(CHSET_XLARGE, cset_L3_t, struct xlarge_char_set,
L2_full, L3_fill_range, L3_contains): Ditto.
(unon char_set): Member x1 present only under FULL_UNICODE.
(char_set_destroy, char_set_add, char_set_add_range,
char_set_contains): CHSET_XLARGE cases only available on
FULL_UNICODE.
(char_set_compile): Default cst variable to CHSET_LARGE.
* utf8.c (FULL_UNICODE): New macro.
(conversion_error): New function.
(utf8_from_uc): Throw error if not FULL_UNICODE and character is
outside the BMP.
(utf8_decode): Likewise.
|
|
|
|
|
|
| |
hash.h, lib.c, lib.h, match.c, match.h, parser.h, parser.l, parser.y,
regex.c, regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c,
unwind.h, utf8.c, utf8.h: Updated e-mail address.
|
|
|
|
|
|
| |
lib.h, match.c, match.h, parser.h, parser.l, parser.y, regex.c,
regex.h, stream.c, stream.h, txr.1, txr.c, txr.h, unwind.c, unwind.h,
utf8.c, utf8.h: Updated copyright year.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
Exponential memory consumption behavior was observed when
matching the input aaaaaa....
against the regex a?a?a?a?....aaaa....
The fix is to eliminate common subexpressions
from the derivative for the or operator.
|
|
|
|
|
|
|
|
|
| |
of cases to reduce consing. In reg_derivative_list, we avoid
consing the full or expression if either branch is t, and
also save a cons when the first element has a null derivative.
In reg_derivative the oneplus and zeroplus cases are split,
since zeroplus can re-use the input expression, when it's
just a one-character match, deriving nil.
|
|
|
|
|
|
|
|
| |
case whereby R%S matches nothing at all when S is not empty
but equivalent to empty, or more generally when S is nullable.
A much nicer definition is ``the intersection of R* and
the set of all strings that do not contain a non-empty substring
that matches S, followed by S''.
|
| |
|
|
|
|
| |
taking a double derivative of the first item.
|
|
|
|
| |
algebraic reductions in the derivative for the operator.
|
|
|
|
|
|
|
| |
NFA or derivatives. The default behavior is NFA, with
derivatives used if the regular expression contains
uses of complement or intersection. The --dv-regex
option forces derivatives always.
|
| |
|
|
|
|
|
| |
regex operations (complement, intersection).
The syntax extensions documentation are retained.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This turns out to be easy to do in NFA land.
The complement of an NFA has exactly the same number
and configuration of states and transitions, except
that the states have an inverted meaning; and furthermore,
failed character transitions are routed to an extra
state (which in this impelmentation is permanently
allocated and shared by all regexes). The regex &
is implemented trivially using DeMorgan's.
Also, bugfix: regular expressions like A|B|C are allowed
now by the syntax, rather than constituting syntax error.
Previously, this would have been entered as (A|B)|C.
|
|
|
|
|
|
| |
no null pointer check over struct cobj_ops operations.
New typechecking function for COBJ objects.
|
|
|
|
|
| |
from now on, which is compatible with unsigned char *.
No implicit conversion to or from this type, in C or C++.
|
|
|
|
| |
in regex module not exposed in header. Etc.
|
|
|
|
| |
can be taken advantage of for better diagnostics.
|
|
|
|
| |
have a _s suffix.
|
|
|
|
|
|
|
|
| |
can be converted to a type long and vice versa. The configure
script tries to detect the appropriate type to use. Also,
some run-time checking is performed in the streams module
to detect which conversions specifier strings to use for
printing numbers.
|
|
|
|
|
|
|
|
|
|
| |
a system package instead of being hacked with the $ prefix.
Keyword symbols are provided. In the matcher, evaluation
is tightened up. Keywords, nil and t are not bindeable, and
errors are thrown if attempts are made to bind them.
Destructuring in dest_bind is strict in the number of items.
String streams are exploited to print bindings to objects
that are not strings or characters. Numerous bugfixes.
|
|
|
|
|
| |
we wouldn't have to declare object variables at all, so why
use an obtuse syntax to do so?)
|
|
|
|
|
|
| |
compiler. Idea: allocator functions return char * instead of void *,
like malloc did in classic pre-ANSI C. That way we are forced to
use a cast except when the target pointer is char * already.
|
| |
|
|
|
|
| |
should be unsigned.
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the changes are in the area of representing sets.
Also, a bug was found in the compilation of regex character sets:
ranges straddling two adjacent blocks of 32 characters were
not being added to the character set. However, ranges falling
within a single 32 block, or spanning three or more such blocks,
worked properly. This bug is not tickled by common ranges
such as A-Z, or 0-9, which land within a 32 block.
|
|
|
|
|
|
|
|
|
| |
This is incomplete. There are too many dependencies on
wide character support from the C stream I/O library,
and implicit use of some encoding which may not be UTF-8.
The regex code does not handle wide characters properly.
Character type is still int in some places, rather than wchar_t.
Test suite passes though.
|
|
|
|
|
|
| |
Regexps can be bound to variables.
New freeform directive.
|
|
|
|
| |
Bugfixes.
|
|
|
|
|
|
|
|
|
|
|
| |
Lazy strings implemented, incompletely.
Changed string function to implicitly strdup; non-strdup
version changed to string_own. Fixed wrong uses of strdup
rather than chk_strdup.
Functions added to regex module to provide regex matching
as a state machine to which characters are fed.
|
|
|
|
|
| |
and used for matching. This Just Works because of
the way match_line treats variables.
|
| |
|
|
|