Phoenix -> csv -> invalid char between encapsulated token and delimiter - hadoop

I need to upload a CSV dump file to the Phoenix database
Files that did not contain any special characters were loaded without problems
./psql.py -t TTT localhost /home/isaev/output.csv -d';'
But as soon as I tried to load the same file in which the data fields were met with quotes, I get an error
java.lang.RuntimeException: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
at org.apache.phoenix.util.UpsertExecutor.execute(UpsertExecutor.java:132)
at org.apache.phoenix.util.CSVCommonsLoader.upsert(CSVCommonsLoader.java:217)
at org.apache.phoenix.util.CSVCommonsLoader.upsert(CSVCommonsLoader.java:182)
at org.apache.phoenix.util.PhoenixRuntime.main(PhoenixRuntime.java:308)
Caused by: java.io.IOException: (line 1) invalid char between encapsulated token and delimiter
at org.apache.commons.csv.Lexer.parseEncapsulatedToken(Lexer.java:275)
at org.apache.commons.csv.Lexer.nextToken(Lexer.java:152)
at org.apache.commons.csv.CSVParser.nextRecord(CSVParser.java:450)
at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:395)
... 5 more
For example on the first line (line 1) I have this entry
5863355029;007320071; ZAO "With a smile for life";True;
I found the solution myself:
-q'\'
Can someone come in handy

You can resolve your problem using quotes 2 times:
5863355029;007320071; ZAO ""With a smile for life"";True;
Each field may or may not be enclosed in double quotes. If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
Check this link if you are interested in why: https://www.marklogic.com/blog/delimited_text_mlcp/

Related

How to insert text starting with double quotes in a column delimited with | in a import command in db2

Table contains 3 columns
ID -integer
Name-varchar
Description-varchar
A file with .FILE extension has data with delimiter as |
Eg: 12|Ramu|"Ramu" is an architect
Command I am using to load data to db2:
db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Description) nonrecoverable"
Data is loaded as follows:
12 Ramu Ramu
but I want it as:
12 Ramu "Ramu" is an architect
Take a look at how the format of delimited ASCII files is defined. The double quote (") is an optional delimited for character data. You would need to escape it. I have not tested it, but I would assume that you double the quote as you would do in SQL:
|12|Ramu|"""Ramu"" is an architect"
Delimited files (CSV) are defined in RFC 4180. You need to either use quotes for the entire field or none at all. Only in fields beginning and ending with a quote, other quotes can be used. They need to be escaped as shown.
Use the nochardel modifier.
If you use '|' as a column delimiter, you must use 0x7C and not 0x7x:
MODIFIED BY coldel0x7C keepblanks nochardel

YAML syntax sed in Gitlab-CI

I've made a mistake in the file below, but I cannot see where my mistake is. I have this command in my .gitlab-ci.yml configuration file.
- sed "s/use_scm_version=True/use_scm_version={'write_to': '..\/version.txt', 'root': '..'},\/"setup.py
It seems that the ":" are interpreted as a semicolon even if I surround the entire sed between double quotes.
(<unknown>): did not find expected key while parsing a block mapping at line 109 column 11
Any ideas ?
Since your double quotes are not at the beginning of the scalar node, they don't have special meaning in YAML and the colon is seen as the normal value indicator (and both the key and value have an embedded double quote).
I recommend you quote the whole scalar:
- "sed s/use_scm_version=True/use_scm_version={'write_to': '..\/version.txt', 'root': '..'},\/setup.py"
And optionally add \" (backslash escaped double quotes) as necessary within there if that doesn't work.

Regex test if value is valid format

I have a task where I need to check if a value is properly quoted CSV column:
cases:
no quotation - OK
"with quotation" - OK
"opening quote - Not Good
improper"quote" - Not Good
closing quote" - Not Good
CSV flags an error like below:
Illegal quoting in line 5. (CSV::MalformedCSVError)
Question: How would I get to have this working using a single regex? I need to flag error for cases 3-5.
And if you have any idea what should be checked if a CSV value is valid or not, please tell so.
EDIT: I have added 2 scenarios/cases below:
"quote "inside quotes" - Not Good
"quotes ""inside quotes" - Not Good
EDIT: added 1 more case:
"" - OK
Without considering escaped quotes :
/^("[^"]*"|[^"]+)$/m
See it here.
It means :
beginning of line
1 quote + anything except quote + 1 quote, or
anything except quote (at least one character)
end of the line
^"{1}.+"{1}$|^[^"]*$
This matches all lines either starting and ending with one quotation mark, or lines not including quotation marks at all.
demo

Escaping double quotes - Prepared Statement in JMeter

I am trying to execute a stored procedure in JMeter using JDBC Request Sampler. One of the parameters include XML that contains quotes
I am getting the following error:
Response message: java.io.IOException: Cannot have quote-char in plain field:[ <xmlns:r="]
The setup:
QueryType: Prepered Update Statement
SQL Query: {CALL SPINSERT(?, ?)}
Parameters Values: Y, <xmlns:r="">
Parameters Types: CHAR, VARCHAR
I suppose I need to escape the double quotes, any ideas, how this should be done properly?
It's actually as per documentation on JMeter website, but the problem appeared to be that you cannot have a white space after the last double quote.
The list must be enclosed in double-quotes if any of the values
contain a comma or double-quote, and any embedded double-quotes must
be doubled-up, for example: "Dbl-Quote: "" and Comma: ,"

Hadoop - textouputformat.separator use ctrlA ( ^A )

I'm trying to use ^A as the separator between Key and Value in my reduce output files.
I found that the config setting "mapred.textoutputformat.separator" is what I want and this correctly switches the separator to ",":
conf.set("mapred.textoutputformat.separator", ",");
But it can't handle the ^A character:
conf.set("mapred.textoutputformat.separator", "\u0001");
throws this error:
ERROR security.UserGroupInformation: PriviledgedActionException as:user (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException; lineNumber: 68; columnNumber: 94; Character reference "&#
I found this ticket https://issues.apache.org/jira/browse/HADOOP-7542 and see they tried to fix this but reverted the patch due to XML1.1 concerns.
SO I'm wondering if anyone has had success setting the separator to ^A (seems pretty common), using an easy work around. Or if I should just settle and use tab separator.
Thanks!
I'm running Hadoop 0.20.2-cdh3u5 on CentOS 6.2
Looking around it looks like there are maybe three options that i've found for solving this problem:
Character reference “&#1” is an invalid XML character - similar SO question
Unicode characters/Ctrl G or Ctrl A as TextOutputFormat (Hadoop) delimiter
The possible solutions as detailed in the link above are:
You can Base64 encode the separator character. You then need to create a custom TextOutputFormat that overrides the getRecordWriter method and decodes the Base64 encoded separator.
Create a custom TextOutputFormat again, except change the default separator character from a tab.
Provide the delimiter through an XML resource file. You can specify a custom resource file using the addResource() method of the jobs Configuration.

Resources