Fix tok-str semantics once again.

The problem is that when the regular expression is capable of matching empty strings, tok-str will extract an empty token immediately following a non-empty token. For instance (tok-str "a,b" /[^,]*/) extracts ("a" "" "b") instead of just ("a" "b"). This is a poor behavior and the way to fix it is to impose a rule that an empty token must not be extracted immediately at the ending position of a previous token. Only a non-empty token can be consecutive to a token. * lib.c (tok_str): Rewrite the logic of the loop, using the prev_empty flag to suppress empty tokens which immediately follow non-empty tokens. The addition of 1 to the position when the token is empty to skip a character is done at the bottom of the loop and a new last_end variable keeps track of the end position of the last extracted token for the purposes of extracting the keep-between area if keep_sep is true. The old loop is preserved intact and enabled by compatibility. * tests/015/split.tl: Multiple empty-regex test cases for tok-str updated. * txr.1: Updated tok-str documentation and also added a note between the conditions under which split-str and tok-str, invoked with keep-sep true, produce equivalent output. Added compatibility notes.
author: Kaz Kylheku <kaz@kylheku.com> 2016-10-26 20:19:42 -0700
committer: Kaz Kylheku <kaz@kylheku.com> 2016-10-26 20:19:42 -0700
commit: e0dbcc3a6455d990c0a0ecde74e279e8f3b53843 (patch)
tree: 835afaf66a49e1e9b0183f13705d83be76c7b07a /txr.1
parent: 88268ee75421084cc412d26250beb7483f49c1b3 (diff)
download: txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.gz
txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.bz2
txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.zip
1 files changed, 58 insertions, 11 deletions
diff --git a/txr.1 b/txr.1
index 007feb36..d2d45722 100644
--- a/txr.1
+++ b/txr.1
@@ -19021,7 +19021,7 @@ into the resulting list, such that if the resulting
 list is catenated, a string equivalent to the original
 string will be produced.
 
-Note: To split a string into pieces of length one such that an empty string
+Note: to split a string into pieces of length one such that an empty string
 produces
 .code nil
 rather than
@@ -19032,6 +19032,31 @@ use the
 .cble
 pattern.
 
+Note: the function call
+.code "(split-str s r t)"
+produces a resulting list identical to
+.codn "(tok-str s r t)" ,
+for all values of
+.code r
+and
+.codn s ,
+provided that
+.code r
+does not match empty strings. If
+.code r
+matches empty strings, then the
+.code tok-str
+call returns extra elements compared to
+.codn split-str ,
+because
+.code tok-str
+allows empty matches to take place and extract empty tokens
+before the first character of the string, and after the
+last character, whereas
+.code split-str
+does not recognize empty separators at these outer limits
+of the string.
+
 .coNP Function @ split-str-set
 .synb
 .mets (split-str-set < string << set )
@@ -19089,25 +19114,36 @@ matches an empty string, then an empty token is returned, and
 the search for another token within
 .meta string
 resumes after advancing by one
-character position. So for instance,
+character position. However, if an empty match occurs immediately
+after a non-empty token, that empty match is not turned into
+a token.
+
+So for instance,
 .cblk
 (tok-str "abc" #/a?/)
 .cble
-returns the
+returns
 .cblk
-("a" "" "" "").
+("a" "" "").
 .cble
 After the token
 .str "a"
 is extracted from a non-empty match
-for the regex, the regex is considered to match three more times: before the
-.strn "b" ,
-between
-.str "b"
+for the regex, an empty match for the regex occurs just
+before the character
+.codn b .
+This match is discarded because it is an empty match which
+immediately follows the non-empty match. The character
+.code b
+is skipped. The next match is an empty match between the
+.code b
 and
-.strn "c" ,
-and after the
-.strn "c" .
+.code c
+characters. This match causes an empty token to be
+extracted. The character
+.code c
+is skipped, and one more empty match occurs after that
+character and is extracted.
 
 If the
 .meta keep-between
@@ -47785,6 +47821,17 @@ of these version values, the described behaviors are provided if
 is given an argument which is equal or lower. For instance
 .code "-C 103"
 selects the behaviors described below for version 105, but not those for 102.
+.IP 155
+After version 155, the
+.code tok-str
+and
+.code tok-where
+functions changed semantics. Previously, these functions exhibited the
+flaw that under some conditions they extracted an empty token immediately
+following a non-empty token. This behavior was working as designed and
+documented, but the design was flawed, creating a major difficulty in simple
+tokenizing tasks when tokens may be empty strings.  Requesting compatibility
+with version 155 or earlier restores the behavior.
 .IP 154
 After version 154, changes were introduced in the semantics of struct
 literals. Previously, the syntax
author	Kaz Kylheku <kaz@kylheku.com>	2016-10-26 20:19:42 -0700
committer	Kaz Kylheku <kaz@kylheku.com>	2016-10-26 20:19:42 -0700
commit	e0dbcc3a6455d990c0a0ecde74e279e8f3b53843 (patch)
tree	835afaf66a49e1e9b0183f13705d83be76c7b07a /txr.1
parent	88268ee75421084cc412d26250beb7483f49c1b3 (diff)
download	txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.gz txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.tar.bz2 txr-e0dbcc3a6455d990c0a0ecde74e279e8f3b53843.zip