Having troble to read a var using FOREACH in Pig Latin - hadoop

I am having trouble with the following pig code.
The previus var I need to read via FOREACH has the following DESCRIBE:
UnionD1D2_Distinct: {UnionD1D2_Foreach1::null::display_site:
chararray,UnionD1D2_Foreach1::efectivos_click:
long,UnionD1D2_Foreach2::null::display_site:
chararray,UnionD1D2_Foreach2::total_click: long}
And here, example data:
(linuxlife.example.com,113,linuxlife.example.com,5343)
(mobilesource.example.com,211,mobilesource.example.com,8120)
(siliconshore.example.com,170,siliconshore.example.com,7764)
(printoperator.example.com,62,printoperator.example.com,2724)
So, the FOREACH reads the data is:
UnionD1D2_Calc = FOREACH UnionD1D2_Distinct
GENERATE
(UnionD1D2_Distinct.UnionD1D2_Foreach1::efectivos_click1/UnionD1D2_Distinct.UnionD1D2_Foreach2::total_click2)*100 AS ctr;
But, I'm always getting the following:
ERROR 1066: Unable to open iterator for alias UnionD1D2_Calc. Backend
error : Scalar has more than one row in the output. 1st :
(filmport.example.com,121,filmport.example.com,5395), 2nd
:(firesale.example.com,129,firesale.example.com,5452)
What am I doing wrong?

When you're using FOREACH on an alias, you don't need to use the alias name again to refer to a variable. For example, instead of UnionD1D2_Distinct.UnionD1D2_Foreach1::efectivos_click1 you can just use UnionD1D2_Foreach1::efectivos_click1.
Please try:
UnionD1D2_Calc = FOREACH UnionD1D2_Distinct GENERATE
(UnionD1D2_Foreach1::efectivos_click1/UnionD1D2_Foreach2::total_click2)*100 AS ctr;
And let us know if you get the same error.

Related

PIG: Process tuples in a bag

I have a data set that looks like this after a GROUP operation :
input = key1|{(a1,b1,c1),(a2,b2,c2)}
key2|{(a3,b3,c3),(a4,b4,c4),(a5,b5,c5)}
I need to traverse the above to generate final output like this :
<KEY>key1</KEY>|
<VALUES><VALUE><VALUE1>a1</VALUE1>VALUE2>b1</VALUE2>VALUE3>c1</VALUE3></VALUE><VALUE><VALUE1>a2</VALUE1><VALUE2>b2</VALUE2><VALUE3>c2</VALUE3> </VALUE></VALUES>
<KEY>key2</KEY>| ...
I have tried to use FLATTEN and CONCAT to achieve this result in the below manner:
A = FOREACH input GENERATE key, FLATTEN(input);
output = FOREACH A GENERATE CONCAT('<KEY>',CONCAT(input.key,'</KEY>')),
CONCAT('<VALUE>',''),
CONCAT('<VALUE1>',CONCAT(input.col1,'</VALUE1>')
...
But this does not give the desired output. Fairly new to pig, so don't know if this is possible.
If you FLATTEN your bag than you'll ended up as many new 'rows' as many elements you had in the bag:
key1|(a1,b1,c1)
key1|(a2,b2,c2)
If I understand your problem correctly you want this:
Use the BagToTuple built in function.
Than you'll get
key1|(a1,b1,c1,a2,b2,c2)
After this you can format your data with e.g. a UDF

Cannot use -tagPath and schema at the same time in PigStorage LOAD

I'm having an interesting behaviour with PigStorage and its -tagPath option, where I do not know if I am doing something wrong (wrong schema definition?) or if this is a limitation/bug in Pig.
My file looks like this (the most basic, I was able to come up with):
A
B
Now I can load and subselect this file like this fine:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';') AS (char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(A)
(B)
(A)
(B)
However, when I try to fetch the filepath with -tagPath (I need it when I access a whole folder of data), the data gets loaded correctly into the first variable, but I cannot subselect a column from it.
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath')
AS (filepath: chararray, char: chararray);
DUMP vals
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt)
However, when I first read the data without schema and then add a schema using FOREACH it works fine again:
vals = LOAD '/user/guest/test.txt'
USING PigStorage(';', '-tagPath');
vals_n = FOREACH vals GENERATE (chararray)$0 AS filepath, (chararray)$1 AS char;
DUMP vals_n
one_column = FOREACH vals GENERATE char;
DUMP one_column
Results in:
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,A)
(hdfs://sandbox.hortonworks.com:8020/user/guest/test.txt,B)
(A)
(B)
So is there any way, I can use -tagPath and schema in the LOAD phase at the same time?
This happens, because pig tries to find out automatically which columns are being used in the script and only load those. When we use -tagFile or -tagPath, it seems this gets confused.
The solution is to run the pig script without this column detection:
pig -x mapreduce -t ColumnMapKeyPrune

Store pig result in a text file

Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.

Pig reading data as databytearray

Hey guys i have one more question I am just not able to understand the behavior of pig
I am loading the data into pig and after some transformation storing it using PigStorage() on hdfs(/user/sga/transformeddata).
But when I load the data from /user/sga/transformeddata location and do
temp = load '/user/sga/transformeddata' using PigStorage();
gen = foreach temp generate page_type;
dump gen;
getting following error:
databytearray can not be cast to java.lang.String
but if i do
gen = foreach temp generate *;
dump gen;
it works fine
any help is totally appreciated to understand this.
As required presenting the code:
STORE union_of_all_records INTO '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
union_of_all_records is an alias in pig.
now another script which will consume this data
lookup_data =
LOAD '/staged/google/page_type_map_file/' using PigStorage() AS (page_type:chararray,page_type_classification:chararray);
load_denorm_clickstream_record =
LOAD '/staged/google/data_after_denormalization' using PigStorage('\t','-schema');
and join on these two aliases
denorm_clickstream_record = LIMIT load_denorm_clickstream_record 100;
join_with_lookup =
JOIN denorm_clickstream_record BY page_type LEFT OUTER, lookup_data BY page_type;
step x : final_output =
FOREACH join_with_lookup
GENERATE denorm_clickstream_record::page_type as page_type;
at step x i get the above error.
I think you have to options:
1) You have to tell Pig the schema that the data has. For example:
temp = load '/user/sga/transformeddata' using PigStorage() AS (page_type:chararray);
2) When you first store the data tell Pigstorage to store the schema information as well. PigStorage('\t', '-schema'); When you load the data as you do above, PigStorage should read the schema from the schema information.

Pig multiple store commands creating duplicate work

I have a pig script which reads input from a file and sends to our custom UDF, which sends back a Map with 2 key/value pair. After that we have to save each key value pair in 2 different locations. We are doing it using Store. Problem we are facing is each STORE command which we are using in our pig script is invoking our custom UDF.
>REGISTER MyUDF.jar;
>LOADFILE = LOAD '$file' AS record:chararray;
>MAPREC = FOREACH LOADFILE GENERATE MyUDF(record);
>ERRLIST = FOREACH MAPREC {
>GENERATE $0#'errorRecord' AS ErrorRecord;
>};
>ERRLIST = FILTER ERRLIST BY ErrorRecord is not null;
>MLIST = FOREACH MAPREC {
>GENERATE $0#'mInfo' AS MRecord;
>};
>MLIST = FILTER MLIST BY MRecord is not null;
>STORE MLIST INTO 'fileOut';
>STORE ERRLIST INTO 'errorDir';
Is there a way in pig script through which UDF will be invoked only once, even if we have multiple STORE....
I think that what's happening under the covers is that MAPREC isn't populated by its assignment statement. Pig is waiting until MAPREC is used (which is twice) to figure out what it contains. I suggest creating an intermediate structure by using a FOREACH to iterate over MAPREC. That would force the calling of MyUDF once and then use that intermediate result twice in place of MAPREC in the following FOREACH statements. Hope that made sense.

Resources