Clickhouse : import data having double quotes escaped by backslash - clickhouse

I am trying to import a html snippet which is part of one of the column in csv.
There are double quotes in the html snippet and its is escaped. this csv is created using apache spark.
for illustrating the issue i have just created 2 columns with minimal data.
CREATE TABLE logs.processing ( ts String,text String) ENGINE = Log
cat sample.csv // Content of the file
"Fri, 01 May 2020 06:47:05 UTC","<html id=\"html-div\">"
The the import command is issued following exception is thrown.
cat sample.csv | clickhouse-client --query="INSERT INTO logs.processing FORMAT CSV"
Exception
Code: 117. DB::Exception: Expected end of line
if i change the content of sample.csv to
"Fri, 01 May 2020 06:47:05 UTC","col2"
It works fine.
Could you please help me on this issue.
Thanks.

The CSV spec requires:
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
It needs to either initially generate valid CSV-file or fix it before passing to CH-client:
cat sample.csv | sed 's/\\"/""/g' | clickhouse-client --query="INSERT INTO logs.processing FORMAT CSV"

I had posted the query in CH github. looks like as of now they have only double quote as escaping character only.
https://github.com/ClickHouse/ClickHouse/issues/10624

Related

How to insert text starting with double quotes in a column delimited with | in a import command in db2

Table contains 3 columns
ID -integer
Name-varchar
Description-varchar
A file with .FILE extension has data with delimiter as |
Eg: 12|Ramu|"Ramu" is an architect
Command I am using to load data to db2:
db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Description) nonrecoverable"
Data is loaded as follows:
12 Ramu Ramu
but I want it as:
12 Ramu "Ramu" is an architect
Take a look at how the format of delimited ASCII files is defined. The double quote (") is an optional delimited for character data. You would need to escape it. I have not tested it, but I would assume that you double the quote as you would do in SQL:
|12|Ramu|"""Ramu"" is an architect"
Delimited files (CSV) are defined in RFC 4180. You need to either use quotes for the entire field or none at all. Only in fields beginning and ending with a quote, other quotes can be used. They need to be escaped as shown.
Use the nochardel modifier.
If you use '|' as a column delimiter, you must use 0x7C and not 0x7x:
MODIFIED BY coldel0x7C keepblanks nochardel

Converting a TXT file with double quotes to a pipe-delimited format using sed

I'm trying to convert TXT files into pipe-delimited text files.
Let's say I have a file called sample.csv:
aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z
I'd like to convert this into an output that looks like this:
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
Now after tons of searching, I have come the closest using this sed command:
sed -r 's/""/\v/g;s/("([^"]+)")?,/\2\|/g;s/"([^"]+)"$/\1/;s/\v/"/g'
However, the output that I received was:
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|pppqqq|rrr" sss|ttt,"uuu|Z
Where the expected for the 9th column should have been ppp"qqq" but the result removed the double quotes and what I got was pppqqq.
I have been playing around with this for a while, but to no avail.
Any help regarding this would be highly appreciated.
As suggested in comments sed or any other Unix tool is not recommended for this kind of complex CSV string. It is much better to use a dedicated CSV parser like this in PHP:
$s = 'aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z';
echo implode('|', str_getcsv($s));
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|nnnooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
The problem with sample.csv is that it mixes non-quoted fields (containing quotes) with fully quoted fields (that should be treated as such).
You can't have both at the same time. Either all fields are (treated as) unquoted and quotes are preserved, or all fields containing a quote (or separator) are fully quoted and the quotes inside are escaped with another quote.
So, sample.csv should become:
"aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z
to give you the desired result (using a csv parser):
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
Have the same problem.
I found right result with https://www.papaparse.com/demo
Here is a FOSS on github. So maybe you can check how it works.
With the source of [ "aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z ]
The result appears in the browser console:
[1]: https://i.stack.imgur.com/OB5OM.png

INSERT using CSV file with single quote in string field cause error

INSERT using CSV file with single quote in string field cause error for scenarios like this
"'Catbug' Animated Series In The Works From 'Adventure Time ..."
But other scenarios with single quote loads successfully. Is there a workaround for this issue?
Disable to interpret single quote as delimeter by using format_csv_allow_single_quotes-parameter:
echo "'Catbug' Animated Series In The Works From 'Adventure Time ..." |
clickhouse-client --query "insert into test format CSV" --format_csv_allow_single_quotes 0
echo "'Catbug' Animated Series In The Works From 'Adventure Time ..." |
clickhouse-client --query "insert into test format CSV settings format_csv_allow_single_quotes=0"
clickhouse-client --format_csv_allow_single_quotes=0
It because CH does an automatic discovery of used quotes using the first character ' or "

Apache Pig store delimiters

I'm using Pig Latin to store values from an alias into the HDFS. The alias contains a semicolon in one of its fields.
dump A;
(Richard & John, 1993)
(Albert, 1994)
A table that shows the data in the HDFS, but the semicolon makes John go to the next column.
| Name | Year |
|--------------|------|
| Richard &amp | John |
| Albert | 1994 |
Trying to use store like this is also not working as expected:
STORE A INTO '/user/hive/warehouse/test.db/names' using PigStorage('\t');
but even when telling PigStore to use tab as delimiter the semicolon breaks the table data. How can I fix it?
I just locally create a file suppose a.txt and copy your data into this file.
(Richard & John, 1993)
(Albert, 1994)
Now I see that your data is not in tab delimiter form and that's why it split after semicolon part.So to solve this problem i just right a query like this
data = load '/home/hduser/Desktop/a.txt' using PigStorage(',');
dump data;
and my output result is this
((Richard & John, 1993))
((Albert, 1994))
I split it using this
,
because your data looks like this delimiter.
Note: I run it my local file system.So to run it locally you must start your pig using this command pig -x local and give your relevant path.
It seems there was a bug in the create table.
create table test.names
(
name varchar(40),
year varchar(40)
)
row format delimited fields terminated by '\073'
lines terminated by '\n';
The delimiter I used was \073 (semicolon), so changing the PigStorage delimiter had no effect.
I'm using \072 (double colon) and it is now working. I think any other delimiter would work as long as it is not a common or possible character in the input data.

Loading data using Hive Sed command

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int. Please help
Also i dont want to use regexserde command.
how can i use sed command from unix to clean the data and get my output.
i tried to learn about sed command and found the replace option. So i can remove the " double quotations. But how can i handle the extra ; semi colon which comes in the middle of the data
Please help
I think you can preprocess with sed and then use the MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES
sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*);/\1XXXXX/g; t a; s/;/ /g; s/XXXXX/;/g' file
This sed matches the quote pairs to avoid processing what is between quotes putting a placeholder for the semicolons outside of quoted text. Afterward it removes the ;'s from the book title text and replaces them w/a space and puts back the semicolons that are outside quotes.
See here for more how to load data using Hive including an example of MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES:
https://svn.apache.org/repos/asf/hive/trunk/serde/README.txt
create external table books (isbn int,title string,year int,publisher string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH
SERDEPROPERTIES ('separatorChar' = '\;' , 'quoteChar' = '\"' ) location 'S3
path/HDFS path for the file';

Resources