How to load data in Pig from .tsv file? - hadoop

I have a .TSV file containing data in HDFS and i am not able to load it into Pig.
The command i am using is "
A = load 'file_location' as (name:chararray, age:int, gpa:float);
B = foreach A generate (name, age);
DUMP B;
Error returned : Unable to find operator for alias A

If you do not specify the delimiter PIG uses default ',' as the delimiter for loading the file.Hence your load statement is failing.You have to explicitly specify the delimiter '\t'.
A = LOAD 'file_location' USING PigStorage('\t') AS (name:chararray, age:int, gpa:float);

Do it like this
A = load 'path/of/file' using PigStorage('\t') AS (name:chararray,age:int,gpa:float);
B = foreach A generate name, age;
DUMP B;
ps: I don't think there is any fault with your commands.As tab (\t) is default delimiter for pig . I am getting the correct output with your commands .Can you please send me logs or screenshot of your terminal.

Related

how to join header row to detail rows in multiple files with apache pig

I have several CSV files in a HDFS folder which I load to a relation with:
source = LOAD '$data' USING PigStorage(','); --the $data is a passed as a parameter to the pig command.
When I dump it, the structure of the source relation is as follows: (note that the data is text qualified but I will deal with that using the REPLACE function)
("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")
<.... more records ....>
("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")
<.... more records ....>
So each file has a header which provides some information about the data set that follows it such as the provider of the data and the date range it covers.
So now, how can I transform the above structure and create a new relation like the following ?:
{
(HEADER,20110118,20101218,20110118,T00002),{(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..},
(HEADER,20110224,20110109,20110224,T00002),{(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..},..more tuples..
}
Where each header tuple is followed by a bag of record tuples belonging to that header ?.
Unfortunately there is no common key field between the header and the detail rows, so I don't think cant use any JOIN operation. ?
I am quite new to Pig and Hadoop and this is one of the first concept projects that I am engaging in.
Hope my question is clear and look forward to some guidance here.
This should get you started.
Code:
Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

Pig: Read multiple files and append column-wise

I want to achieve this in Pig but not sure about an efficient way.
I have an input file(with header: COL1,COL2,COL3,COL4,TAG) and multiple "value" files all with similar format (TAG,VALUE). I want to append the "VALUE" column of each "value" file with the input file based on "TAG" as key column. So if there are 3 "value" files then format of final combined file will be (COL1,COL2,COL3,COL4,TAG,VALUE1,VALUE2,VALUE3).
One approach I can think of is to read each "value" file and then join with input file in an incremental way. So we will have multiple intermediate files.
Like first join input file with one value file and output will be : COL1,COL2,COL3,COL4,TAG,VALUE1 .
Now this becomes new input file and join with another "value" file and output will be COL1,COL2,COL3,COL4,TAG,VALUE1,VALUE2.
Is there a better way ?
You could use COGROUP with multiple relations, it will cause only one MR job. Following code was typed without testing, but the idea should work:
header = LOAD 'header_path' using PigStorage(',') AS (COL1,COL2,COL3,COL4,TAG);
tv_1 = LOAD 'tv_1' using PigStorage(',') AS (TAG,VALUE);
tv_2 = LOAD 'tv_2' using PigStorage(',') AS (TAG,VALUE);
tv_3 = LOAD 'tv_3' using PigStorage(',') AS (TAG,VALUE);
joined = COGROUP header BY TAG, tv_1 BY TAG, tv_2 BY TAG, tv_3 BY TAG;
result = FOREACH joined GENERATE FLATTEN(header), FLATTEN((IsEmpty(tv_1) ? TOBAG(TOTUPLE(null) : tv_1.VALUE)) AS VALUE1, FLATTEN((IsEmpty(tv_2) ? TOBAG(TOTUPLE(null) : tv_2.VALUE)) AS VALUE2, FLATTEN((IsEmpty(tv_3) ? TOBAG(TOTUPLE(null) : tv_3.VALUE)) AS VALUE3;

Cannot use -tagPath and schema at the same time in PigStorage LOAD

I'm having an interesting behaviour with PigStorage and its -tagPath option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig.
My file looks like this (the most basic, I was able to come up with):
A
B
Now I can load and subselect this file like this fine:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';') AS (char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(A)
(B)
(A)
(B)
However, when I try to fetch the filepath with -tagPath (I need it when I access a whole folder of data), the data gets loaded correctly into the first variable, but I cannot subselect a column from it.
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath')
AS (filepath: chararray, char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
However, when I first read the data without schema and then add a schema using FOREACH it works fine again:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath');
vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;
DUMP vals_n
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)
So is there any way, I can use -tagPath and schema in the LOAD phase at the same time?
This happens, because pig tries to find out automatically which columns are being used in the script and only load those. When we use -tagFile or -tagPath, it seems this gets confused.
The solution is to run the pig script without this column detection:
pig -x mapreduce -t ColumnMapKeyPrune

Store pig result in a text file

Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.

Pig reading data as databytearray

Hey guys i have one more question I am just not able to understand the behavior of pig
I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).
But when I load the data from /user/sga/transformeddata location and do
temp = load '/user/sga/transformeddata' using PigStorage();
gen = foreach temp generate page_type;
dump gen;
getting following error:
databytearray can not be cast to java.lang.String
but if i do
gen = foreach temp generate *;
dump gen;
it works fine
any help is totally appreciated to understand this.
As required presenting the code:
STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
union_of_all_records is an alias in pig.
now another script which will consume this data
lookup_data =
LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);
load_denorm_clickstream_record =
LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
and join on these two aliases
denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;
step x : final_output =
FOREACH join_with_lookup
GENERATE denorm_clickstream_record::page_type as page_type;
at step x i get the above error.
I think you have to options:
1) You have to tell Pig the schema that the data has. For example:
temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);
2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.

Resources