Using replace text processor while converting fixed width file to delimited with normal character like ';' , '|' ,',' as delimiters is working. However considering \u0001 or [^]A or \^A is not working as expected.
to use special chars you could use literal + unescapeXml nifi expression functions:
${literal(''):unescapeXml()}
Related
Table contains 3 columns
ID -integer
Name-varchar
Description-varchar
A file with .FILE extension has data with delimiter as |
Eg: 12|Ramu|"Ramu" is an architect
Command I am using to load data to db2:
db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Description) nonrecoverable"
Data is loaded as follows:
12 Ramu Ramu
but I want it as:
12 Ramu "Ramu" is an architect
Take a look at how the format of delimited ASCII files is defined. The double quote (") is an optional delimited for character data. You would need to escape it. I have not tested it, but I would assume that you double the quote as you would do in SQL:
|12|Ramu|"""Ramu"" is an architect"
Delimited files (CSV) are defined in RFC 4180. You need to either use quotes for the entire field or none at all. Only in fields beginning and ending with a quote, other quotes can be used. They need to be escaped as shown.
Use the nochardel modifier.
If you use '|' as a column delimiter, you must use 0x7C and not 0x7x:
MODIFIED BY coldel0x7C keepblanks nochardel
I'm trying to make the grammar expressions for strings for a pseudo-language based on python, and I'm wondering how I can do the following:
The string starts or either ends with " or ' also, it can include any character except / " ' \n. Those characters can only be included when another backslash is leaded, for example:
'Mark said, \"Boo!\"\n'(accepted)
I'm trying to load a csv that contains the character "|" without success
can i escape it or use other techinieue?
can you help?
thanks
If you are using '|' as your delimiter and some fields also contain '|', you can escape them as '\|'. (Or with some other character, if you've changed your escape character. But by default, '\'.)
If you have a lot of these, it might be easier to change your delimiter character. It doesn't have to be '|'. For example, you can do this:
=> COPY t1 FROM '/data/*.csv' DELIMITER '+';
You can use any ASCII value in the range E'\000' to E'\177', inclusive. See the documentation for COPY parameters.
For Hive version - 0.14
Can we provide a custom record delimiter "\r\r\n" instead of defaults ' [ "\r" , "\n", "\r\n" ]
As a result, in my case 2 lines become 4 lines in HIVE because of default line separators whereas I needed "\r\r\n" to be line separator.
Though there is custom field delimiter org.apache.pig.piggybank.storage.MyRegExLoader , for custom record delimiter converted newlines to null using PIG and used newline as record delimiter
How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?
If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:
Here is a sample dataset that I have loaded:
R1Col1 R1Col2 "R1Col3 MoreData" R1Col4
R2Col2 R2Col2 "R2Col3 MoreData" R2Col4
Register the jar from hive console
hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
Create a table with the specified serde and custom properties
hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'
> with serdeproperties(
> "separatorChar" = "\t",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile;
Load your dataset into the table:
hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
Do a select * from test_table to check the results
You could download the csv-serde from here.
It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.
To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.
Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.
We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.
Row format delimited fields terminated by \t escaped by \\ lines terminated by \n