Pig multiple store commands creating duplicate work - performance

I have a pig script which reads input from a file and sends to our custom UDF, which sends back a Map with 2 key/value pair. After that we have to save each key value pair in 2 different locations. We are doing it using Store. Problem we are facing is each STORE command which we are using in our pig script is invoking our custom UDF.
>REGISTER MyUDF.jar;
>LOADFILE = LOAD '$file' AS record:chararray;
>MAPREC = FOREACH LOADFILE GENERATE MyUDF(record);
>ERRLIST = FOREACH MAPREC {
>GENERATE $0#'errorRecord' AS ErrorRecord;
>};
>ERRLIST = FILTER ERRLIST BY ErrorRecord is not null;
>MLIST = FOREACH MAPREC {
>GENERATE $0#'mInfo' AS MRecord;
>};
>MLIST = FILTER MLIST BY MRecord is not null;
>STORE MLIST INTO 'fileOut';
>STORE ERRLIST INTO 'errorDir';
Is there a way in pig script through which UDF will be invoked only once, even if we have multiple STORE....

I think that what's happening under the covers is that MAPREC isn't populated by its assignment statement. Pig is waiting until MAPREC is used (which is twice) to figure out what it contains. I suggest creating an intermediate structure by using a FOREACH to iterate over MAPREC. That would force the calling of MyUDF once and then use that intermediate result twice in place of MAPREC in the following FOREACH statements. Hope that made sense.

Related

Having troble to read a var using FOREACH in Pig Latin

I am having trouble with the following pig code.
The previus var I need to read via FOREACH has the following DESCRIBE:
UnionD1D2_Distinct: {UnionD1D2_Foreach1::null::display_site:
chararray,UnionD1D2_Foreach1::efectivos_click:
long,UnionD1D2_Foreach2::null::display_site:
chararray,UnionD1D2_Foreach2::total_click: long}
And here, example data:
(linuxlife.example.com,113,linuxlife.example.com,5343)
(mobilesource.example.com,211,mobilesource.example.com,8120)
(siliconshore.example.com,170,siliconshore.example.com,7764)
(printoperator.example.com,62,printoperator.example.com,2724)
So, the FOREACH reads the data is:
UnionD1D2_Calc = FOREACH UnionD1D2_Distinct
GENERATE
(UnionD1D2_Distinct.UnionD1D2_Foreach1::efectivos_click1/UnionD1D2_Distinct.UnionD1D2_Foreach2::total_click2)*100 AS ctr;
But, I'm always getting the following:
ERROR 1066: Unable to open iterator for alias UnionD1D2_Calc. Backend
error : Scalar has more than one row in the output. 1st :
(filmport.example.com,121,filmport.example.com,5395), 2nd
:(firesale.example.com,129,firesale.example.com,5452)
What am I doing wrong?
When you're using FOREACH on an alias, you don't need to use the alias name again to refer to a variable. For example, instead of UnionD1D2_Distinct.UnionD1D2_Foreach1::efectivos_click1 you can just use UnionD1D2_Foreach1::efectivos_click1.
Please try:
UnionD1D2_Calc = FOREACH UnionD1D2_Distinct GENERATE
(UnionD1D2_Foreach1::efectivos_click1/UnionD1D2_Foreach2::total_click2)*100 AS ctr;
And let us know if you get the same error.

PIG: Process tuples in a bag

I have a data set that looks like this after a GROUP operation :
input = key1|{(a1,b1,c1),(a2,b2,c2)}
key2|{(a3,b3,c3),(a4,b4,c4),(a5,b5,c5)}
I need to traverse the above to generate final output like this :
<KEY>key1</KEY>|
<VALUES><VALUE><VALUE1>a1</VALUE1>VALUE2>b1</VALUE2>VALUE3>c1</VALUE3></VALUE><VALUE><VALUE1>a2</VALUE1><VALUE2>b2</VALUE2><VALUE3>c2</VALUE3> </VALUE></VALUES>
<KEY>key2</KEY>| ...
I have tried to use FLATTEN and CONCAT to achieve this result in the below manner:
A = FOREACH input GENERATE key, FLATTEN(input);
output = FOREACH A GENERATE CONCAT('<KEY>',CONCAT(input.key,'</KEY>')),
CONCAT('<VALUE>',''),
CONCAT('<VALUE1>',CONCAT(input.col1,'</VALUE1>')
...
But this does not give the desired output. Fairly new to pig, so don't know if this is possible.
If you FLATTEN your bag than you'll ended up as many new 'rows' as many elements you had in the bag:
key1|(a1,b1,c1)
key1|(a2,b2,c2)
If I understand your problem correctly you want this:
Use the BagToTuple built in function.
Than you'll get
key1|(a1,b1,c1,a2,b2,c2)
After this you can format your data with e.g. a UDF

Store pig result in a text file

Hi stackoverflow community;
i'm totally new to pig, i want to STORE the result in a text file and name it as i want. is it possible do this using STORE function.
My code:
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput’;
Thanks.
Yes you will be able to store your result in myoutput.txt and you can load the data into file with any delimiter you want using PigStorage.
a = LOAD 'example.csv' USING PigStorage(';');
b = FOREACH a GENERATE $0,$1,$2,$3,$6,$7,$8,$9,$11,$12,$13,$14,$20,$24,$25;
STORE b INTO ‘myoutput.txt’ using PigStorage(';');
Yes, it is possible. b will store every row into 25 different columns - $0 to S25.

Pig LOADER for SPLUNK like records

I am trying to use PIG to read data from HDFS where the files contain rows that look like:
"key1"="value1", "key2"="value2", "key3"="value3"
"key1"="value10", "key3"="value30"
In a way the rows of the data are essentially dictionaries:
{"key1":"value1", "key2":"value2", "key3":"value3"}
{"key1":"value10", "key3":"value30"}
I can read and dump portion of this data easily enough with something like:
data = LOAD '/hdfslocation/weirdformat*' as PigStorage(',');
sampled = SAMPLE data 0.00001;
dump sampled;
My problem is that I can't parse it efficiently. I have tried to use
org.apache.pig.piggybank.storage.MyRegExLoader
but it seems extremely slow.
Could someone recommend a different approach?
Seems like one way is to use a python UDF.
This solution is heavily inspired from bag-to-tuple
In myudfs.py write:
#!/usr/bin/python
def FieldPairsGenerator(dataline):
for x in dataline.split(','):
k,v = x.split('=')
yield (k.strip().strip('"'),v.strip().strip('"'))
#outputSchema("foo:map[]")
def KVDataToDict(dataline):
return dict( kvp for kvp in FieldPairsGenerator(dataline) )
then write the following Pig script:
REGISTER 'myudfs.py' USING jython AS myfuncs;
data = LOAD 'whereyourdatais*.gz' AS (foo:chararray);
A = FOREACH data GENERATE myfuncs.KVDataToDict(foo);
A now has the data stored as a PigMap

Carry fields around, or store and join?

In Hadoop using Pig, I have a large number of fields in a few separate sources which I load, filter, project, group, run through a couple Java UDFs, join, project and store. (That's everyday life in Hadoop.) Some of the fields in the original load of data aren't used by the UDFs and aren't needed until the final store.
When is it better to pass unused fields through UDFs than to store and join them later?
A trivial toy example is a data source with columns name,weight,height and I ultimately want to store name,weight,heightSquared. My UDF is going to square the height for me. Which is better:
inputdata = LOAD 'data' AS name,weight,height;
outputdata = FOREACH inputdata
GENERATE myudf.squareHeight(name,weight,height)
AS (name,weight,heightSquared);
STORE outputdata INTO 'output';
or
inputdata = LOAD 'data' AS name,weight,height;
name_weight = FOREACH inputdata
GENERATE name,weight;
intdata1 = FOREACH inputdata
GENERATE myudf.squareHeight(name,height)
AS (iname,heightSquared);
intdata2 = JOIN intdata1 BY iname, name_weight BY name;
outputdata = FOREACH intdata2
GENERATE name,weight,heightSquared;
STORE outputdata INTO 'output';
In this case it looks pretty obvious: the first case is better. But the UDF does have to read and store and output the weight field. When you have 15 fields the UDF doesn't care about and one it does, is the first case still better?
If you have 15 fields the UDF doesn't care about, then don't send them to the UDF. In your example, there's no reason to write your UDF to take three fields if it's only going to use the third one. The best script for your example would be
inputdata = LOAD 'data' AS name,weight,height;
outputdata =
FOREACH inputdata
GENERATE
name,
weight,
myudf.squareHeight(height) AS heightSquared;
STORE outputdata INTO 'output';
So that addresses the UDF case. If you have a bunch of fields that you'll want to store, but you are not going to use them in any of the next several map-reduce cycles, you may wish to store them immediately and then join them back in. But that would be a matter of empirically testing which approach is faster for your specific case.

Resources