On 2023-11-28 23:50, halloleo wrote:
> Hi TXR people!
>
> I have CSV files like this sample:
>
>
> col1,col2,col3
> aaa,“a,b,c“,ccc
> 111,222,“1,200.30“
> ...
Are those the actual double quote characters?
You have: “ (U+201C)
The ASCII double quote is: " (U+0022)
I'm going to assume that some e-mail text editor ate your ASCII double quotes, replacing them with Unicode "sixty-sixes".
> and I want to string-concat the first two columns into a third column directly after the two first columns. So the result file of the sample would be:
>
> col1,col2,col1col3,col3
> aaa,“a,b,c“,“aaaa,b,c“,ccc
> 111,222,“111222“,“1,200.30“
>
> How can I do this with TXR? Or is TXR not a good tool for this?
There are many ways to solve this.
How important is it to correctly treat all the quoted CSV fields?
Is it something you will be running regularly on new data?
Is this just a one-off problem you need solved, without any further investment in doing interesting things with the data?
We can treat it as a dumb text processing problem, where the example data
captures all the variations that we need to handle:
$ cat data
col1,col2,col3
aaa,"a,b,c",ccc
111,222,"1,200.30"
$ cat cols.txr
@(repeat)
@ (cases)
@c1,"@c2",@rest
@ (bind out `@c1,@c2,"@c1@c2",@rest`)
@ (or)
@c1,@c2,@rest
@ (bind out `@c1,@c2,@c1@c2,@rest`)
@ (end)
@ (do (put-line out))
@(end)
$ txr cols.txr data
col1,col2,col1col2,col3
aaa,a,b,c,"aaaa,b,c",ccc
111,222,111222,"1,200.30"
We can treat CSV with quoted fields and double quotes representing
single quotes in TXR Lisp:
$ cat csv.tl
(defun csv-split (str)
(flow str
(tok #/[^,]*|"([^"]|"")+"/)
(mapcar (do if (starts-with "\"" @1)
(regsub `""` `"` [@1 1..-1])
@1))))
(defun csv-fmt (list)
(flow list
(mapcar [iffi #/[,"]/ (ret `"@(regsub "\"" "\"\"" @1)"`)])
`@{@1 ","}`))
(defun csv-test ()
(whilet ((str (get-line)))
(let* ((fields (csv-split str))
(csv (csv-fmt fields)))
(put-line `@str -> @(tostring fields) -> @csv`))))
$ txr -i csv.tl
TXR's no-spray organic production means every bug is carefully removed by hand.
1> (csv-test)
a,b,c
a,b,c -> ("a" "b" "c") -> a,b,c
a,b c,c
a,b c,c -> ("a" "b c" "c") -> a,b c,c
a,"b,c",d
a,"b,c",d -> ("a" "b,c" "d") -> a,"b,c",d
a,"b,""c,d""e f",g
a,"b,""c,d""e f",g -> ("a" "b,\"c,d\"e f" "g") -> a,"b,""c,d""e f",g
nil
This is a strict CSV implementation which doesn't allow spurious
spaces around quoted fields.
With these CSV functions we can write a loop like this, to do that
col1,col2,col1col2,col3... thing:
(whilet ((line (get-line)))
(flow line
csv-split
(tree-bind (col1 col2 . rest) @1
^(,col1 ,col2 ,`@col1@col2` ,*rest))
csv-fmt
put-line))
Or:
(whilet ((line (get-line)))
(flow line
csv-split
(let f)
^(,[f 0] ,[f 1] ,(join [f 0] [f 1]) ,*[f 2..:])
csv-fmt
put-line))
Or other ways.