In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it? - hadoop

In pig while using Load data flow step what is difference with using (Using PigStorage) and with out using it?
want to know the difference between below steps.
movie2 = load 'movie/part-m-00000' as (mid:int, mname:chararray, myr:int);
movie2 = load 'movie/part-m-00000' using PigStorage(',') as (mid:int, mname:chararray, myr:int);

The default is to use PigStorage, which is a textfile in which fields are separated by a delimeter, with the tab character as the delimeter.
Specifying using PigStorage(',') changes the delimeter to a comma.

Adding to answer of rsp, there are 2 advantages of using PigStorage
Option to specify the file delimiter
Option to load the schema of the input or not.
More details here: http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/PigStorage.html

Related

Adding column at the end to pipe delimited file in NiFi

I have this particular pipe delimited file in a SFTP server
PROPERTY_ID|START_DATE|END_DATE|CAPACITY
1|01-JAN-07|31-DEC-30|101
2|01-JAN-07|31-DEC-30|202
3|01-JAN-07|31-DEC-30|151
4|01-JAN-07|31-DEC-30|162
5|01-JAN-07|31-DEC-30|224
I need to transfer this data to S3 bucket using NiFi. In this process I need to add another column which is today's date at the end.
PROPERTY_ID|START_DATE|END_DATE|CAPACITY|AS_OF_DATE
1|01-JAN-07|31-DEC-30|101|20-10-2020
2|01-JAN-07|31-DEC-30|202|20-10-2020
3|01-JAN-07|31-DEC-30|151|20-10-2020
4|01-JAN-07|31-DEC-30|162|20-10-2020
5|01-JAN-07|31-DEC-30|224|20-10-2020
what is the simple way to implement this in NiFi?
#Naga here is a very similar post that describes the ways to solve adding a new column on CSV:
Apache NiFi: Add column to csv using mapped values
The simplest way is ReplaceText to append the same "|20-10-2020" to each line. ReplaceText settings will be evaluate line by line and Regex: $1|20-10-2020. The other methods are additional ways to do that more dynamically, for example if the date isnt static.

Loading to hive new line character from CSV File

We are having a file, which is of the following type:
1- Sam, Joshua , "52 DD dr,
Lake Hiawatha" , New Jersey, 07034
2- Ruchi,kumari,SNN Raj serenity,Bengaluru, 560068
The line 1 is split into 2 rows in the External table with the rest of the columns being null in 1st row and 2nd row is having rest of the data.
Need assistance on what is the best way to load in a single column to overcome this issue. Went through a couple of solutions in the web , but was not clear.
Tried the following options:
1) Used the Regex Serde
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
but it did not work
2) CSVInputFormat from github
https://github.com/mvallebr/CSVInputFormat
But not able to use it.
I tried the following option and it worked for me ,
1) Regex tester - for this new line scenario the regex is very complicated , and it is not working.
2) Use CVS parser provided by https://github.com/mvallebr/CSVInputFormat and also had a chat with him on how to use it. Tried multiple options but not working.
3) The quick simple fix is to try the legacy method to replace the new lines in the File using shell or Perl command and it worked smoothly. Seems that to be a more feasible and easy option.

Issue with Comma as a Delimiter in Latin Pig for free text column

I am loading a file to PigStorage. The file has a column Newvalue, a free text column which includes commas in it. When I specify comma as delimiter this gives me problem. I am using following code.
inpt = load '/home/cd36630/CRM/1monthSample.txt' USING PigStorage(',')
AS (BusCom:chararray,Operation:chararray,OperationDate:chararray,
ISA:chararray,User:chararray,Field:chararray,Oldvalue:chararray,
Newvalue:chararray,RecordId:chararray);
Any help is appreciated.
If the input is in csv form then you can use CSVLoader to load it. This may fix your issue.
If this doesn't work then you can load into a single chararray and then write a UDF to split the total line in a way that respects the spaces in Newvalue. EG:
register 'myudfs.py' using jython as myudfs ;
A = LOAD '/home/cd36630/CRM/1monthSample.txt' AS (total:chararray) ;
B = FOREACH A GENERATE myudf.prepare_input(total) ;

PIG - HBASE - Casting values

I'm using PIG to process rows in an HBase table. The values in the HBase table are stored as bytearrays.
I can't figure out if I have to write a UDF that casts bytearrays to various types, or if pig does that automatically.
I have the following script:
raw = LOAD 'hbase://TABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('CF:I') AS (product_id:bytearray);
ids = FOREACH raw GENERATE (int)product_id;
dump ids;
I get a list of parenthesis '()'.
According to the docs, it should work. I checked the value in hbase shell they're all value=\x00\x00\x00\x02
How can i get this to work?
Needed to add the following option to get it to cast...
LOAD 'hbase://TABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('CF:I','-caster HBaseBinaryConverter') AS (product_id:bytearray);
Thanks to this post.
If you have non-text values in the columns, you need to specify the -caster option with HBaseBinaryConverter(default is Utf8StorageConverter) and map them to respective types so that PIG will cast them properly before serializing them in text.
a = load 'hbase://TESTTABLE_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('TESTCOLUMN_A TESTCOLUMN_B TESTCOLUMN_C ','-loadKey -caster HBaseBinaryConverter') as (rowKey:chararray,col_a:int, col_b:double, col_c:chararray);

Reading files in PIG where delemeter comes in data

I want to read a CSV file using PIG what should i Do?. I used load n pigstorage(',') but it fails to read CSV file properly because where it encounters comma (,) in data it splits it.How should i give delimeter now if i have comma in data also?
It's generally impossible to distinguish comma in data from comma as a delimiter.
You will need to escape that comma that is in your 'data' and custom load function (for Pig) that can recognize escaped commas.
Take a look here:
http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html
http://pig.apache.org/docs/r0.7.0/udf.html#Load%2FStore+Functions
Have you had a look at the CSVLoader loader in the PiggyBank if you want to read a CSV file? (of course the file format needs to be valid)
First make sure you have a valid CSV file. In the case you haven't try to change the source file through Excel (if the file is small) or other tool and export a new CSV with a good delimiter for your data (Ex: \t tab, ; , etc). Even better can be do another extract with a "good" delimiter.
Example of your load can be then something like this:
TABLE = LOAD 'input.csv' USING PigStorage(';') AS ( site_id: int,
name: chararray, ... );
Example of your DUMP:
STORE TABLE INTO 'clean.csv' using PigStorage(','); <- delimiter that suits you best

Resources