Can we split a file column by :: delimiter in pig - hadoop

I am trying to read a file which have delimiter as double colon (::). I am using CSVExcelStorage, but it is giving error as:
could not instantiate 'org.apache.pig.piggybank.storage.CSVExcelStorage' with arguments '[::]'
So is there any way to read a file using custom delimiter?

You can use PigStorage with your custom delimiter.

You are probably missing the quotes.
REGISTER /usr/lib/pig/piggybank.jar;
A = LOAD 'Test.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage('::')

Related

How to insert text starting with double quotes in a column delimited with | in a import command in db2

Table contains 3 columns
ID -integer
Name-varchar
Description-varchar
A file with .FILE extension has data with delimiter as |
Eg: 12|Ramu|"Ramu" is an architect
Command I am using to load data to db2:
db2 "Load CLIENT FROM ABC.FILE of DEL MODIFIED BY coldel0x7x keepblanks REPLACE INTO tablename(ID,Name,Description) nonrecoverable"
Data is loaded as follows:
12 Ramu Ramu
but I want it as:
12 Ramu "Ramu" is an architect
Take a look at how the format of delimited ASCII files is defined. The double quote (") is an optional delimited for character data. You would need to escape it. I have not tested it, but I would assume that you double the quote as you would do in SQL:
|12|Ramu|"""Ramu"" is an architect"
Delimited files (CSV) are defined in RFC 4180. You need to either use quotes for the entire field or none at all. Only in fields beginning and ending with a quote, other quotes can be used. They need to be escaped as shown.
Use the nochardel modifier.
If you use '|' as a column delimiter, you must use 0x7C and not 0x7x:
MODIFIED BY coldel0x7C keepblanks nochardel

Converting a TXT file with double quotes to a pipe-delimited format using sed

I'm trying to convert TXT files into pipe-delimited text files.
Let's say I have a file called sample.csv:
aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z
I'd like to convert this into an output that looks like this:
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
Now after tons of searching, I have come the closest using this sed command:
sed -r 's/""/\v/g;s/("([^"]+)")?,/\2\|/g;s/"([^"]+)"$/\1/;s/\v/"/g'
However, the output that I received was:
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|pppqqq|rrr" sss|ttt,"uuu|Z
Where the expected for the 9th column should have been ppp"qqq" but the result removed the double quotes and what I got was pppqqq.
I have been playing around with this for a while, but to no avail.
Any help regarding this would be highly appreciated.
As suggested in comments sed or any other Unix tool is not recommended for this kind of complex CSV string. It is much better to use a dedicated CSV parser like this in PHP:
$s = 'aaa",bbb"ccc,"ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","nnn"ooo,ppp"qqq",rrr" sss,"ttt,""uuu",Z';
echo implode('|', str_getcsv($s));
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|nnnooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
The problem with sample.csv is that it mixes non-quoted fields (containing quotes) with fully quoted fields (that should be treated as such).
You can't have both at the same time. Either all fields are (treated as) unquoted and quotes are preserved, or all fields containing a quote (or separator) are fully quoted and the quotes inside are escaped with another quote.
So, sample.csv should become:
"aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z
to give you the desired result (using a csv parser):
aaa"|bbb"ccc|ddd,eee|fff|ggg,hhh,iii|jjj kkk|lll" mmm|"nnn"ooo|ppp"qqq"|rrr" sss|ttt,"uuu|Z
Have the same problem.
I found right result with https://www.papaparse.com/demo
Here is a FOSS on github. So maybe you can check how it works.
With the source of [ "aaa""","bbb""ccc","ddd,eee",fff,"ggg,hhh,iii","jjj kkk","lll"" mmm","""nnn""ooo","ppp""qqq""","rrr"" sss","ttt,""uuu",Z ]
The result appears in the browser console:
[1]: https://i.stack.imgur.com/OB5OM.png

How to use Regex Capturing Group variable in NiFi Expression Language?

I am trying to replace a date format in all lines of text file using NiFi. The file looks like this:
ABCDE,20200619,23.8
FGHIJ,20200619,14.5
...
I am trying to do this using ReplaceText processor to change 20200619 to 2020-06-19. I've made regex expression matching the date ((20\d{6},)) and I have checked that it's working: when i write $1 TESTING, in Replacement value it works as expected (single line of file looks like ABCDE,20200619, TESTING,23.8).
The problem is when I try to use Expression Language and :substring function. This is my code in Replacement value:
${$1:substring(0, 4)}-${$1:substring(4, 6)}-${$1:substring(6, 8)}
But I get following error:
NiFi Error
It looks like the Expression Language can't access my $1 variable. How can I access my Regex Capturing Group variable inside Expression Language?
This is my processor:
NiFi Processor
I found the answer: when trying to access Regex Capturing Group inside ${...} we need to use it with apostrophes, so the code like this works:
${'$1':substring(0, 4)}-${'$1':substring(4, 6)}-${'$1':substring(6, 8)}

Multi-line JSON read using Apache PIG

I have a JSON file and want to read using Apache Pig.
I tried using the regular JSONLOADER, but looks like JSONLOADER works only with single line JSON. Then I tried with Elephant-Bird. But I am still not able to see the results correctly. Can any one please suggest a solution?
Input :
{"employees":[
{"firstName":"John", "lastName":"Doe"},
{"firstName":"Anna", "lastName":"Smith"},
{"firstName":"Peter", "lastName":"Jones"}
]}
Note : I dont want to convert the input in to a single line.
Script:
A = LOAD 'input' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
B = FOREACH A GENERATE FLATTEN($0#'employees');
Dump B;
Expected result should be :
([firstName#John,lastName#Doe])
([firstName#Anna,lastName#Smith])
([firstName#Peter,lastName#Jones])
As mentioned in the comments by siva, the answer is basically that you do need to change your input to a single line.
JsonLoader or elephantbird loader will always works only with single
line . It will not work with multiline. You need to convert your input
to single line before passing to pig. One workaround would be write a
shell script and call the logic to replace multiline to single line
using 'SED' command and then call the pig script in the shell script.
This link will help you how to call pig thru shell script.

Exporting delimited Text file to Excel file using Shell Script

my text file is delimited by pipeline '|'
I want to export this in to excel file (xls) using a script in Unix
can anyone please help
My suggestion would be,
Convert the delimiter | to ,
Save the file with csv extension
Open the file in excel.
Note: If you have , in the file contents other than token separator this idea will not work.
If you want to convert your file to .xls format then you will have to use apache POI library. It has perl support.
If you just want to open it in excel then you can directly use open with excel and set the seperator as |.
Or put all the words in " " and use , as the seperator. If it is within "" then comma within the text will not be an issue. But double quotes within the text will be a problem.
To avoid all these you can use some other ascii character as the seperator.

Resources