summaryrefslogtreecommitdiffstats
path: root/txr.1
diff options
context:
space:
mode:
authorKaz Kylheku <kaz@kylheku.com>2016-10-26 20:19:42 -0700
committerKaz Kylheku <kaz@kylheku.com>2016-10-26 20:19:42 -0700
commite0dbcc3a6455d990c0a0ecde74e279e8f3b53843 (patch)
tree835afaf66a49e1e9b0183f13705d83be76c7b07a /txr.1
parent88268ee75421084cc412d26250beb7483f49c1b3 (diff)
downloadtxr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.gz
txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.bz2
txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.zip
Fix tok-str semantics once again.
The problem is that when the regular expression is capable of matching empty strings, tok-str will extract an empty token immediately following a non-empty token. For instance (tok-str "a,b" /[^,]*/) extracts ("a" "" "b") instead of just ("a" "b"). This is a poor behavior and the way to fix it is to impose a rule that an empty token must not be extracted immediately at the ending position of a previous token. Only a non-empty token can be consecutive to a token. * lib.c (tok_str): Rewrite the logic of the loop, using the prev_empty flag to suppress empty tokens which immediately follow non-empty tokens. The addition of 1 to the position when the token is empty to skip a character is done at the bottom of the loop and a new last_end variable keeps track of the end position of the last extracted token for the purposes of extracting the keep-between area if keep_sep is true. The old loop is preserved intact and enabled by compatibility. * tests/015/split.tl: Multiple empty-regex test cases for tok-str updated. * txr.1: Updated tok-str documentation and also added a note between the conditions under which split-str and tok-str, invoked with keep-sep true, produce equivalent output. Added compatibility notes.
Diffstat (limited to 'txr.1')
-rw-r--r--txr.169
1 files changed, 58 insertions, 11 deletions
diff --git a/txr.1 b/txr.1
index 007feb36..d2d45722 100644
--- a/txr.1
+++ b/txr.1
@@ -19021,7 +19021,7 @@ into the resulting list, such that if the resulting
list is catenated, a string equivalent to the original
string will be produced.
-Note: To split a string into pieces of length one such that an empty string
+Note: to split a string into pieces of length one such that an empty string
produces
.code nil
rather than
@@ -19032,6 +19032,31 @@ use the
.cble
pattern.
+Note: the function call
+.code "(split-str s r t)"
+produces a resulting list identical to
+.codn "(tok-str s r t)" ,
+for all values of
+.code r
+and
+.codn s ,
+provided that
+.code r
+does not match empty strings. If
+.code r
+matches empty strings, then the
+.code tok-str
+call returns extra elements compared to
+.codn split-str ,
+because
+.code tok-str
+allows empty matches to take place and extract empty tokens
+before the first character of the string, and after the
+last character, whereas
+.code split-str
+does not recognize empty separators at these outer limits
+of the string.
+
.coNP Function @ split-str-set
.synb
.mets (split-str-set < string << set )
@@ -19089,25 +19114,36 @@ matches an empty string, then an empty token is returned, and
the search for another token within
.meta string
resumes after advancing by one
-character position. So for instance,
+character position. However, if an empty match occurs immediately
+after a non-empty token, that empty match is not turned into
+a token.
+
+So for instance,
.cblk
(tok-str "abc" #/a?/)
.cble
-returns the
+returns
.cblk
-("a" "" "" "").
+("a" "" "").
.cble
After the token
.str "a"
is extracted from a non-empty match
-for the regex, the regex is considered to match three more times: before the
-.strn "b" ,
-between
-.str "b"
+for the regex, an empty match for the regex occurs just
+before the character
+.codn b .
+This match is discarded because it is an empty match which
+immediately follows the non-empty match. The character
+.code b
+is skipped. The next match is an empty match between the
+.code b
and
-.strn "c" ,
-and after the
-.strn "c" .
+.code c
+characters. This match causes an empty token to be
+extracted. The character
+.code c
+is skipped, and one more empty match occurs after that
+character and is extracted.
If the
.meta keep-between
@@ -47785,6 +47821,17 @@ of these version values, the described behaviors are provided if
is given an argument which is equal or lower. For instance
.code "-C 103"
selects the behaviors described below for version 105, but not those for 102.
+.IP 155
+After version 155, the
+.code tok-str
+and
+.code tok-where
+functions changed semantics. Previously, these functions exhibited the
+flaw that under some conditions they extracted an empty token immediately
+following a non-empty token. This behavior was working as designed and
+documented, but the design was flawed, creating a major difficulty in simple
+tokenizing tasks when tokens may be empty strings. Requesting compatibility
+with version 155 or earlier restores the behavior.
.IP 154
After version 154, changes were introduced in the semantics of struct
literals. Previously, the syntax